Multi-scale Feature Extraction and Dynamic Aggregation for Referring Image Segmentation

Authors

  • Xinzhuo Gao
  • Quange Tan
  • Siyu Meng
  • Mingjie Wang
  • Rong Wang

DOI:

https://doi.org/10.54097/emt8dy47

Keywords:

Referring Image Segmentation, Multi-scale Feature, Feature Extraction, Attention Mechanism, Dynamic Aggregation

Abstract

Image referring segmentation aims to accurately locate and segment the corresponding target regions in an image based on textual descriptions. Existing methods mainly face three major challenges: Firstly, there is insufficient information interaction in the extraction process of multi-scale visual features, making it dif-ficult to balance local details and global semantics. Secondly, cross-modal semantic alignment lacks a bidirectional guidance mechanism, resulting in weak semantic consistency between image and text features. Thirdly, the feature aggregation strategy lacks dynamic adaptability and cannot flexibly adjust feature weights according to scene complexity, leading to insufficient feature discriminability. To address the above issues, a referring image segmentation method based on a chained collaborative mechanism for multi-scale feature extraction and dynamic aggregation is proposed. Firstly, a multi-scale adaptive feature fusion module is constructed. The image features after patch embedding are split into four channel sub-spaces through group convolution. After extracting multi-scale features via adaptive max-pooling, the gating network dynamically adjusts the fusion weights to realize the adaptive interaction between local details and global semantics, thus improving the diversity of feature expression. Secondly, a dual-attention modal alignment module is designed. Channel-first convolutional attention is introduced into image fea-tures, which adjusts the weights of different channels and spatial positions in the feature map under the guidance of text features to strengthen the expression of key regions. Efficient channel attention is in-troduced into text features, which takes image features as prior and enhances the semantic relevance of text through channel interaction, improving the accuracy and efficiency of feature alignment. Finally, a dy-namic feature aggregation module is proposed to explore the non-linear dependencies between channels, generate fine-grained feature weights, and dynamically adjust features of each scale through competitive weight allocation, so as to realize the complementary aggregation of global and local features and improve the discriminability of features. Experimental results demonstrate that the proposed method achieves In-tersection over Union (IoU) scores of 74.34% and 66.59% on testA and testB of the RefCOCO dataset, 68.99% and 53.87% on testA and testB of the RefCOCO+ dataset, and 61.65% on the test set of the G-Ref dataset, respectively, verifying the effectiveness of the proposed method.

Downloads

Download data is not yet available.

References

[1] Hu R, Rohrbach M, Darrell T. Segmentation from natural language expressions[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 108-124.

[2] Huang S, Hui T, Liu S, et al. Referring image segmentation via cross-modal progressive comprehension[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10488-10497.

[3] Li M, Sigal L. Referring transformer: A one-step approach to multi-task visual grounding[J]. Advances in neural in-formation processing systems, 2021, 34: 19652-19664.

[4] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]// International conference on machine learning. PmLR, 2021: 8748-8763.

[5] Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]// Proceedings of the IEEE/CVF international conference on computer vision. 2023: 4015-4026.

[6] Mao J, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 11-20.

[7] Ouyang S, Wang H, Xie S, et al. SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation [C]// IJCAI. 2023: 1294-1302.

[8] Xu Z, Chen Z, Zhang Y, et al. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 17503-17512.

[9] Rohrbach A, Rohrbach M, Hu R, et al. Grounding of textual phrases in images by reconstruction[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 817-834.

[10] Liu S, Hui T, Huang S, et al. Cross-modal progressive comprehension for referring segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(9): 4761-4775.

[11] Yang Z, Wang J, Tang Y, et al. Lavt: Language-aware vision transformer for referring image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18155-18165.

[12] Kim N, Kim D, Lan C, et al. Restr: Convolution-free refer-ring image segmentation using transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18145-18154.

[13] Feng G, Hu Z, Zhang L, et al. Encoder fusion network with co-attention embedding for referring image segmentation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15506-15515.

[14] Jing Y, Kong T, Wang W, et al. Locate then segment: A strong pipeline for referring image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 9858-9867.

[15] Feng G, Hu Z, Zhang L, et al. Bidirectional relationship inferring network for referring image localization and seg-mentation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 34(5): 2246-2258.

[16] Wang Z, Lu Y, Li Q, et al. Cris: Clip-driven referring image segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11686-11695.

[17] Zhang M, Liu Y, Yin X, et al. MARIS: Referring Image Segmentation via Mutual-Aware Attention Features[J]. arXiv preprint arXiv:2311.15727, 2023.

[18] Sun L, Dong J, Tang J, et al. Spatially-adaptive feature modulation for efficient image super-resolution[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 13190-13199.

[19] Huang H, Chen Z, Zou Y, et al. Channel prior convolutional attention for medical image segmentation. arXiv 2023[J]. arXiv preprint arXiv:2306.05196.

[20] Wang Q, Wu B, Zhu P, et al. ECA-Net: Efficient channel attention for deep convolutional neural net-works[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 11534-11542.

[21] Xie L, Li C, Wang Z, et al. Shisrcnet: Super-resolution and classification network for low-resolution breast cancer histopathology image[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023: 23-32.

[22] Kazemzadeh S, Ordonez V, Matten M, et al. Referitgame: Referring to objects in photographs of natural scenes[C]// Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 787-798.

[23] Yu L, Poirson P, Yang S, et al. Modeling context in referring expressions[C]//Computer Vision–ECCV 2016: 14th Euro-pean Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer International Publishing, 2016: 69-85.

[24] Liu C, Lin Z, Shen X, et al. Recurrent multimodal interaction for referring image segmentation[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1271-1280.

[25] Margffoy-Tuay E, Pérez J C, Botero E, et al. Dynamic multimodal instance segmentation guided by natural language queries[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 630-645.

[26] Li R, Li K, Kuo Y C, et al. Referring image segmentation via recurrent refinement networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5745-5753.

[27] Yu L, Lin Z, Shen X, et al. Mattnet: Modular attention network for referring expression comprehension[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1307-1315.

[28] Liu D, Zhang H, Wu F, et al. Learning to assemble neural module tree networks for visual grounding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4673-4682.

[29] Ye L, Rochan M, Liu Z, et al. Cross-modal self-attention network for referring image segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 10502-10511.

[30] Chen Y W, Tsai Y H, Wang T, et al. Referring expression object segmentation with caption-aware consistency[J]. arXiv preprint arXiv:1910.04748, 2019.

[31] Yang S, Xia M, Li G, et al. Bottom-up shift and reasoning for referring image segmentation[C]//Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition. 2021: 11266-11275.

[32] Luo G, Zhou Y, Ji R, et al. Cascade grouped attention net-work for referring expression segmentation[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1274-1282.

Downloads

Published

04-03-2026

Issue

Section

Articles

How to Cite

Gao, X., Tan, Q., Meng, S., Wang, M., & Wang, R. (2026). Multi-scale Feature Extraction and Dynamic Aggregation for Referring Image Segmentation. Frontiers in Computing and Intelligent Systems, 15(2), 46-56. https://doi.org/10.54097/emt8dy47