Referring Image Segmentation via Register-Aware Feature Selection and Adaptive Adaptation
DOI:
https://doi.org/10.54097/perg7629Keywords:
Referring Image Segmentation; Vision Foundation Model; Dinov3; Parameter-Efficient Fine-Tuning; Register Tokens; Cross-Modal FusionAbstract
Referring Image Segmentation (RIS) aims to segment target regions in an image at pixel level based on natural language descriptions. This paper proposes RAFS (Register-Aware Feature Selection and Adaptation), a parameter-efficient framework built upon DINOv3. Three lightweight adapter modules are designed: MLFS for adaptive multi-level feature aggregation via learnable Gaussian weighting, SCSA for efficient cross-modal fusion via depthwise separable convolutions and cross-modal attention, and RAFF for leveraging register tokens’ global context to enhance local features. With only 6.31M additional parameters, RAFS achieves competitive performance on RefCOCO, RefCOCO+, and G-Ref benchmarks.
References
[1] Hu R, Rohrbach M, Darrell T. Segmentation from natural language expressions[C]// ECCV, 2016: 108-124.
[2] Yu L, Poirson P, Yang S, et al. Modeling context in referring expressions[C]// ECCV, 2016: 69-85.
[3] Yang Z, Wang J, Tang Y, et al. LAVT: Language-Aware Vision Transformer for referring image segmentation[C]// CVPR, 2022: 18155-18165.
[4] Kamath A, Singh M, LeCun Y, et al. MDETR: Modulated detection for end-to-end multi-modal understanding[C]// ICCV, 2021: 1780-1790.
[5] Wang Z, Lu Y, Li Q, et al. CRIS: CLIP-driven referring image segmentation[C]// CVPR, 2022: 11686-11695.
[6] [6] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]// ICCV, 2021: 9650-9660.
[7] Oquab M, Darcet T, Moutakanni T, et al. DINOv2: Learning robust visual features without supervision[J]. TMLR, 2024.
[8] Siméoni O, Vo H V, Seitzer M, et al. DINOv3[R]. arXiv:2508.10104, 2025.
[9] Lin F, Yu S, Han G, et al. DETRIS: Dense Aligner with Text-Rich Features for Referring Image Segmentation[C]// AAAI, 2025.
[10] Hu R, Xu H, Rohrbach M, et al. Natural language object retrieval[C]// CVPR, 2016: 4555-4564.
[11] Mao J, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C]// CVPR, 2016: 11-20.
[12] Li R, Li K, Kuo Y C, et al. Referring image segmentation via recurrent refinement networks[C]// CVPR, 2018: 5745-5753.
[13] Ding H, Liu C, Wang S, et al. VLT: Vision-Language Transformer for referring segmentation[J]. TPAMI, 2023, 45(6): 7900-7916.
[14] Kim D, Kim D, Cho S, et al. ReSTR: Convolution-free referring image segmentation using Transformers[C]// CVPR, 2022: 18145-18154.
[15] Xu M, Wang Y, Liu L, et al. ETRIS: Efficient referring image segmentation[J]. arXiv:2310.12006, 2023.
[16] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]// CVPR, 2016: 770-778.
[17] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]// ICML, 2021: 8748-8763.
[18] Han X, Zhang Z, Ding N, et al. Pre-trained models: Past, present and future[J]. AI Open, 2021, 2: 225-250.
[19] Houlsby N, Giber A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]// ICML, 2019: 2790-2799.
[20] Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of large language models[C]// ICLR, 2022.
[21] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning[C]// EMNLP, 2021: 3045-3059.
[22] Jia M, Tang L, Chen B C, et al. Visual prompt tuning[C]// ECCV, 2022: 709-727.
[23] Chen S, Ge C, Tong Z, et al. AdaptFormer: Adapting vision Transformers for scalable visual recognition[C]// NeurIPS, 2022.
[24] Darcet T, Oquab M, Mairal J, et al. Vision Transformers need registers[C]// ICLR, 2024.
[25] Yu L, Lin Z, Shen X, et al. MattNet: Modular attention network for referring expression comprehension[C]// CVPR, 2018: 1307-1315.
[26] Mao J, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C]// CVPR, 2016: 11-20.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Xiaozhen Gao

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







