Attention-Refined Two-Branch Networks for Real-Time Semantic Segmentation

Authors

  • Shize Xu
  • Yongsheng Dong

DOI:

https://doi.org/10.54097/dvt0cw52

Keywords:

Real-time Semantic Segmentation, Dual Attention, Two-branch.

Abstract

In real-time demanding scenarios as autonomous driving, real-time semantic segmentation is becoming more and more crucial. BiSeNetV2 has been shown to be an effective model, but its performance in improving speed is limited, especially while maintaining high accuracy. Furthermore, feature map detail loss results from combining high-level semantic and detail information, which is especially crucial for real-time semantic segmentation tasks. In this paper, an efficient Attention Refined Two-Branch Real-Time Semantic Segmentation Network (ARTRNet) is designed to alleviate the above challenges. Specifically, the whole network adopts a two-branch structure: a spatial detail branch and a lightweight dense connectivity context refinement branch, and the lightweight dense connectivity context refinement branch is composed of a novel downsampling module (DSM) and a lightweight dense feature module, which achieves high efficiency in terms of reduced computational cost and model size. In addition, the attention vector of each feature map is computed by residual linking of the Attention Refinement Module (ARM) to highlight the features. A low-resolution context aggregation module (LRCAM) consisting of lightweight Ghost modules is also proposed to enhance the spatial information processing capability of the lightweight densely connected context refinement branch. In the final fusion stage, the Deformed Convolutional Attention Refinement Fusion Module (DCARFM) is proposed, which can enhance the feature expression of the branch and improve the final segmentation results by performing the attention refinement operation on the dual branches separately. Finally, experiments on Cityscape and CamVid datasets show that ARTRNet achieves a good balance between segmentation accuracy and inference speed. On the Cityscapes dataset, we achieved 75.7% mIoU at 132 FPS and 76.9% mIoU at 96 FPS on higher resolution images.

Downloads

Download data is not yet available.

References

[1] Azuma R T. A survey of augmented reality [J]. Presence: Teleoperators and Virtual Environments, 1997, 6(4): 355-385.

[2] Siam M, Gamal M, Abdel-Razek M, et al. A comparative study of real-time semantic segmentation for autonomous driving[C]. IEEE Conference On Computer Vision and Pattern Recognition, 2018: 587-597.

[3] You H, Yu L, Tian S, et al. DR-Net: Dual-rotation network with feature map enhancement for medical image segmentation [J]. Complex and Intelligent Systems, 2021: 1-13.

[4] Dechesne C, Mallet C, Le Bris A, et al. Semantic segmentation of forest stands of pure species combining airborne lidar data and very high resolution multispectral imagery [J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2017, 126: 129-145.

[5] Zhuang J, Wang Z, Wang B. Video semantic segmentation with distortion-aware feature correction [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(8): 3128-3139.

[6] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]. European Conference on Computer Vision, 2018: 801-818.

[7] Nirkin Y, Wolf L, Hassner T. Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation [C]. IEEE Conference on Computer Vision and Pattern Recognition, 2021: 4061-4070.

[8] Yuan Y, Huang L, Guo J, et al. OCNet: Object context for semantic segmentation [J]. International Journal of Computer Vision, 2021, 129(8): 2375-2398.

[9] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2015: 3431-3440.

[10] Hung S W, Lo S Y, Hang H M. Incorporating luminance, depth and color information by a fusion-based network for semantic segmentation [C]. IEEE International Conference on Image Processing, 2019: 2374-2378.

[11] Romera E, Alvarez J M, Bergasa L M, et al. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation [J]. IEEE Transactions on Intelligent Transportation Systems, 2017, 19(1): 263-272.

[12] Li X, You A, Zhu Z, et al. Semantic flow for fast and accurate scene parsing [C]. European Conference on Computer Vision, 2020: 775-793.

[13] Poudel R P K, Liwicki S, Cipolla R. Fast-scnn: Fast semantic segmentation network [J]. arXiv preprint arXiv:1902.04502, 2019.

[14] Dong Y, Zhao K, Zheng L, et al. Refinement co‐supervision network for real‐time semantic segmentation [J]. IET Computer Vision, 2023, 17(6): 652-662.

[15] Shvets A A, Rakhlin A, Kalinin A A, et al. Automatic instrument segmentation in robot-assisted surgery using deep learning [C]. IEEE International Conference on Machine Learning and Applications, 2018: 624-628.

[16] Zhao H, Qi X, Shen X, et al. Icnet for real-time semantic segmentation on high-resolution images [C]. European Conference on Computer Vision, 2018: 405-420.

[17] Yu C, Wang J, Peng C, et al. Bisenet: Bilateral segmentation network for real-time semantic segmentation [C]. European Conference on Computer Vision, 2018: 325-341.

[18] Yu C, Gao C, Wang J, et al. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation [J]. International Journal of Computer Vision, 2021, 129: 3051-3068.

[19] Mehta S, Rastegari M, Caspi A, et al. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation [C]. European Conference on Computer Vision, 2018: 552-568.

[20] Lo S Y, Hang H M, Chan S W, et al. Efficient dense modules of asymmetric convolution for real-time semantic segmentation [C]. ACM International Conference on Multimedia in Asia, 2019: 1-6.

[21] Otsu N. A threshold selection method from gray-level histograms [J]. Automatica, 1975, 11(285-296): 23-27.

[22] Vincent L, Soille P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991, 13(06): 583-598.

[23] Ren, Malik. Learning a classification model for segmentation [C]. IEEE International Conference on Computer Vision, 2003: 10-17 vol. 1.

[24] Barbu A. Training an active random field for real-time image denoising [J]. IEEE Transactions on Image Processing, 2009, 18(11): 2451-2462.

[25] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.

[26] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation [C]. Medical Image Computing and Computer-Assisted Intervention, 2015: 234-241.

[27] Li H, Xiong P, Fan H, et al. Dfanet: Deep feature aggregation for real-time semantic segmentation [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2019: 9522-9531.

[28] Hong Y, Pan H, Sun W, et al. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes [J]. arXiv preprint arXiv:2101.06085, 2021.

[29] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2019: 3146-3154.

[30] Huang Z, Wang X, Huang L, et al. Ccnet: Criss-cross attention for semantic segmentation [C]. IEEE International Conference on Computer Vision, 2019: 603-612.

[31] Han K, Wang Y, Tian Q, et al. Ghostnet: More features from cheap operations [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2020: 1580-1589.

[32] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2017: 4700-4708.

[33] Azad R, Niggemeier L, Hüttemann M, et al. Beyond self-attention: Deformable large kernel attention for medical image segmentation [C]. IEEE Winter Conference on Applications of Computer Vision, 2024: 1287-1297.

[34] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2016: 3213-3223.

[35] Brostow G J, Shotton J, Fauqueur J, et al. Segmentation and recognition using structure from motion point clouds [C]. European Conference on Computer Vision, 2008: 44-57.

[36] Fan M, Lai S, Huang J, et al. Rethinking bisenet for real-time semantic segmentation [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2021: 9716-9725.

[37] Wang H, Jiang X, Ren H, et al. Swiftnet: Real-time video object segmentation [C]. IEEE Conference On Computer Vision and Pattern Recognition, 2021: 1296-1305.

[38] Jiang W, Xie Z, Li Y, et al. Lrnnet: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation [C]. IEEE International Conference on Multimedia and Expo Workshops, 2020: 1-6.

[39] Dong G, Yan Y, Shen C, et al. Real-time high-performance semantic image segmentation of urban street scenes [J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(6): 3258-3274.

[40] Peng J, Liu Y, Tang S, et al. PP-liteseg: A superior real-time semantic segmentation model. arXiv 2022[J]. arXiv preprint arXiv:2204.02681.

[41] Paszke A, Chaurasia A, Kim S, et al. Enet: A deep neural network architecture for real-time semantic segmentation [J]. arXiv preprint arXiv:1606.02147, 2016.

Downloads

Published

01-12-2024

Issue

Section

Articles

How to Cite

Xu, S., & Dong, Y. (2024). Attention-Refined Two-Branch Networks for Real-Time Semantic Segmentation. Academic Journal of Science and Technology, 13(2), 266-275. https://doi.org/10.54097/dvt0cw52