YOLO-AlignRank: Cross-Scale Deformable Head and Rank-Consistent Loss for One-Stage Object Detection
DOI:
https://doi.org/10.54097/n0j5k987Keywords:
Object Detection, Deformable Convolution, Multi-Scale Fusion, Quality-Aware Classification, NMS-freeAbstract
Single-stage object detectors dominate industrial deployment thanks to end-to-end simplicity and low latency, yet a persistent bottleneck is the misalignment between classification confidence and localization quality. Because candidate ordering before and after NMS is largely driven by classification scores, high-IoU predictions are often suppressed while poorly localized but high-score boxes survive. Prior quality-aware classification methods partly mitigate this issue, but stop short of a systematic solution that spans cross-scale representation and ranking consistency. Meanwhile, the YOLO family has advanced with decoupled heads, dynamic label assignment, and even NMS-free training, creating an opportunity to unify head structure, quality learning, and ranking constraints. We propose YOLO-AlignRank, which integrates two complementary innovations. First, the CSD-Head (Cross-Scale Sparse Deformable Head) augments a YOLO decoupled head with cross-scale sparse deformable sampling and a Bidirectional Cross-Scale Dynamic Fusion (BCDM) module. A small set of learnable offsets samples key points from adjacent pyramid levels, inheriting the spatial adaptivity of DCNv2 and the sparse-attention spirit of Deformable DETR while preserving convolutional efficiency. BCDM then performs gated top-down and bottom-up feature fusion via lightweight dynamic convolution, achieving real-time multi-scale integration akin to PAN/BiFPN. Second, the RCQ-Loss (Rank-Consistent Quality Loss) extends quality-aware classification with intra-set list alignment and pairwise ranking regularization. For each ground-truth object g with candidate set Sg, RCQ-Loss aligns the distribution of classification scores with normalized localization qualities (proportional to IoU) within Sg and enforces matching order. Concretely, a soft distribution-alignment term (softmax cross-entropy over Sg) and a pairwise hinge term ensure that high-IoU candidates receive higher scores than low-IoU ones, enforcing consistency in both score distribution and sorted order. Together, these components align confidence with IoU across scales and candidate sets, reduce pre- and post-NMS mis-ranking, and improve multi-scale detection accuracy while retaining YOLO-level real-time efficiency and end-to-end simplicity in practical deployments.
Downloads
References
[1] K. He, R. Girshick, and P. Dollár, "DETR: End-to-End Object Detection with Transformers," Advances in Neural Information Processing Systems (NeurIPS), 2020.
[2] H. Zhang, H. Wang, F. Dayoub, and N. Sunderhauf, "VarifocalNet: An IoU-Aware Dense Object Detector," in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8514–8523.
[3] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, "YOLOX: Exceeding YOLO Series in 2021," arXiv:2107.08430, 2021.
[4] C. Wang, Y. Song, H. Li, et al., "YOLOv10: Real-Time End-to-End Object Detection," arXiv:2405.14458, 2024.
[5] X. Zhu, H. Hu, S. Lin, and J. Dai, "Deformable ConvNets v2: More Deformable, Better Results," arXiv:1811.11168, 2018.
[6] X. Zhu, W. Su, L. Lu, et al., "Deformable DETR: Deformable Transformers for End-to-End Object Detection," Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[7] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path Aggregation Network for Instance Segmentation," Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8759–8768.
[8] R. Hogan, "What is YOLOv10? An Architecture Deep Dive," Roboflow Blog, Jun. 2024.
[9] Z. Zheng, P. Wang, W. Liu, et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression," Proc. AAAI Conf. on Artificial Intelligence, 2020, pp. 12993–13000.
[10] Z. Xu, C. Zhang, and Z. Li, "OASL: Orientation-Aware Adaptive Sampling Learning for Arbitrary-Oriented Object Detection," Expert Systems with Applications, vol. 238, p. 122242, 2024.
[11] E. Oksuz, C. Cam, E. Akbas, and F. Porikli, "Rank & Sort Loss for Object Detection and Instance Segmentation," in Proc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2021, pp. 2980-2989.
[12] H. Xu, X. Zhao, and W. Yu, "Adaptive Dynamic Non-Monotonic Focal IoU Loss for Object Detection," IEEE Access, vol. 12, pp. 105679–105692, 2024.
[13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature Pyramid Networks for Object Detection," arXiv:1612.03144, 2016.
[14] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path Aggregation Network for Instance Segmentation," arXiv:1803.01534, 2018.
[15] Code Huddle, "Improving Instance Segmentation Using Path Aggregation Network," Medium, Oct. 2019.
[16] B. Li, Y. Liu, W. Ouyang, et al., "Prime Sample Attention in Object Detection," arXiv:1904.04821, 2019.
[17] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: More deformable, better results,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316.
[18] B. Yang, G. Bender, Q. V. Le, and J. Ngiam, “CondConv: Conditionally parameterized convolutions for efficient inference,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.