Referring Image Segmentation via Register-Aware Feature Selection and Adaptive Adaptation

Authors

  • Xiaozhen Gao

DOI:

https://doi.org/10.54097/perg7629

Keywords:

Referring Image Segmentation; Vision Foundation Model; Dinov3; Parameter-Efficient Fine-Tuning; Register Tokens; Cross-Modal Fusion

Abstract

Referring Image Segmentation (RIS) aims to segment target regions in an image at pixel level based on natural language descriptions. This paper proposes RAFS (Register-Aware Feature Selection and Adaptation), a parameter-efficient framework built upon DINOv3. Three lightweight adapter modules are designed: MLFS for adaptive multi-level feature aggregation via learnable Gaussian weighting, SCSA for efficient cross-modal fusion via depthwise separable convolutions and cross-modal attention, and RAFF for leveraging register tokens’ global context to enhance local features. With only 6.31M additional parameters, RAFS achieves competitive performance on RefCOCO, RefCOCO+, and G-Ref benchmarks.

References

[1] Hu R, Rohrbach M, Darrell T. Segmentation from natural language expressions[C]// ECCV, 2016: 108-124.

[2] Yu L, Poirson P, Yang S, et al. Modeling context in referring expressions[C]// ECCV, 2016: 69-85.

[3] Yang Z, Wang J, Tang Y, et al. LAVT: Language-Aware Vision Transformer for referring image segmentation[C]// CVPR, 2022: 18155-18165.

[4] Kamath A, Singh M, LeCun Y, et al. MDETR: Modulated detection for end-to-end multi-modal understanding[C]// ICCV, 2021: 1780-1790.

[5] Wang Z, Lu Y, Li Q, et al. CRIS: CLIP-driven referring image segmentation[C]// CVPR, 2022: 11686-11695.

[6] [6] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]// ICCV, 2021: 9650-9660.

[7] Oquab M, Darcet T, Moutakanni T, et al. DINOv2: Learning robust visual features without supervision[J]. TMLR, 2024.

[8] Siméoni O, Vo H V, Seitzer M, et al. DINOv3[R]. arXiv:2508.10104, 2025.

[9] Lin F, Yu S, Han G, et al. DETRIS: Dense Aligner with Text-Rich Features for Referring Image Segmentation[C]// AAAI, 2025.

[10] Hu R, Xu H, Rohrbach M, et al. Natural language object retrieval[C]// CVPR, 2016: 4555-4564.

[11] Mao J, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C]// CVPR, 2016: 11-20.

[12] Li R, Li K, Kuo Y C, et al. Referring image segmentation via recurrent refinement networks[C]// CVPR, 2018: 5745-5753.

[13] Ding H, Liu C, Wang S, et al. VLT: Vision-Language Transformer for referring segmentation[J]. TPAMI, 2023, 45(6): 7900-7916.

[14] Kim D, Kim D, Cho S, et al. ReSTR: Convolution-free referring image segmentation using Transformers[C]// CVPR, 2022: 18145-18154.

[15] Xu M, Wang Y, Liu L, et al. ETRIS: Efficient referring image segmentation[J]. arXiv:2310.12006, 2023.

[16] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]// CVPR, 2016: 770-778.

[17] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]// ICML, 2021: 8748-8763.

[18] Han X, Zhang Z, Ding N, et al. Pre-trained models: Past, present and future[J]. AI Open, 2021, 2: 225-250.

[19] Houlsby N, Giber A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]// ICML, 2019: 2790-2799.

[20] Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of large language models[C]// ICLR, 2022.

[21] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning[C]// EMNLP, 2021: 3045-3059.

[22] Jia M, Tang L, Chen B C, et al. Visual prompt tuning[C]// ECCV, 2022: 709-727.

[23] Chen S, Ge C, Tong Z, et al. AdaptFormer: Adapting vision Transformers for scalable visual recognition[C]// NeurIPS, 2022.

[24] Darcet T, Oquab M, Mairal J, et al. Vision Transformers need registers[C]// ICLR, 2024.

[25] Yu L, Lin Z, Shen X, et al. MattNet: Modular attention network for referring expression comprehension[C]// CVPR, 2018: 1307-1315.

[26] Mao J, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C]// CVPR, 2016: 11-20.

Downloads

Published

30-04-2026

Issue

Section

Articles

How to Cite

Gao, X. (2026). Referring Image Segmentation via Register-Aware Feature Selection and Adaptive Adaptation. Mathematical Modeling and Algorithm Application, 8(3), 123-135. https://doi.org/10.54097/perg7629