YOLO-AlignRank: Cross-Scale Deformable Head and Rank-Consistent Loss for One-Stage Object Detection

Run Li; Fangjun Liu

doi:10.54097/n0j5k987

Authors

Run Li
Fangjun Liu

DOI:

https://doi.org/10.54097/n0j5k987

Keywords:

Object Detection, Deformable Convolution, Multi-Scale Fusion, Quality-Aware Classification, NMS-free

Abstract

Single-stage object detectors dominate industrial deployment thanks to end-to-end simplicity and low latency, yet a persistent bottleneck is the misalignment between classification confidence and localization quality. Because candidate ordering before and after NMS is largely driven by classification scores, high-IoU predictions are often suppressed while poorly localized but high-score boxes survive. Prior quality-aware classification methods partly mitigate this issue, but stop short of a systematic solution that spans cross-scale representation and ranking consistency. Meanwhile, the YOLO family has advanced with decoupled heads, dynamic label assignment, and even NMS-free training, creating an opportunity to unify head structure, quality learning, and ranking constraints. We propose YOLO-AlignRank, which integrates two complementary innovations. First, the CSD-Head (Cross-Scale Sparse Deformable Head) augments a YOLO decoupled head with cross-scale sparse deformable sampling and a Bidirectional Cross-Scale Dynamic Fusion (BCDM) module. A small set of learnable offsets samples key points from adjacent pyramid levels, inheriting the spatial adaptivity of DCNv2 and the sparse-attention spirit of Deformable DETR while preserving convolutional efficiency. BCDM then performs gated top-down and bottom-up feature fusion via lightweight dynamic convolution, achieving real-time multi-scale integration akin to PAN/BiFPN. Second, the RCQ-Loss (Rank-Consistent Quality Loss) extends quality-aware classification with intra-set list alignment and pairwise ranking regularization. For each ground-truth object g with candidate set S_g, RCQ-Loss aligns the distribution of classification scores with normalized localization qualities (proportional to IoU) within S_g and enforces matching order. Concretely, a soft distribution-alignment term (softmax cross-entropy over S_g) and a pairwise hinge term ensure that high-IoU candidates receive higher scores than low-IoU ones, enforcing consistency in both score distribution and sorted order. Together, these components align confidence with IoU across scales and candidate sets, reduce pre- and post-NMS mis-ranking, and improve multi-scale detection accuracy while retaining YOLO-level real-time efficiency and end-to-end simplicity in practical deployments.

Downloads

Download data is not yet available.

References

[1] K. He, R. Girshick, and P. Dollár, "DETR: End-to-End Object Detection with Transformers," Advances in Neural Information Processing Systems (NeurIPS), 2020.

[2] H. Zhang, H. Wang, F. Dayoub, and N. Sunderhauf, "VarifocalNet: An IoU-Aware Dense Object Detector," in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8514–8523.

[3] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, "YOLOX: Exceeding YOLO Series in 2021," arXiv:2107.08430, 2021.

[4] C. Wang, Y. Song, H. Li, et al., "YOLOv10: Real-Time End-to-End Object Detection," arXiv:2405.14458, 2024.

[5] X. Zhu, H. Hu, S. Lin, and J. Dai, "Deformable ConvNets v2: More Deformable, Better Results," arXiv:1811.11168, 2018.

[6] X. Zhu, W. Su, L. Lu, et al., "Deformable DETR: Deformable Transformers for End-to-End Object Detection," Proc. Int. Conf. on Learning Representations (ICLR), 2021.

[7] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path Aggregation Network for Instance Segmentation," Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8759–8768.

[8] R. Hogan, "What is YOLOv10? An Architecture Deep Dive," Roboflow Blog, Jun. 2024.

[9] Z. Zheng, P. Wang, W. Liu, et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression," Proc. AAAI Conf. on Artificial Intelligence, 2020, pp. 12993–13000.

[10] Z. Xu, C. Zhang, and Z. Li, "OASL: Orientation-Aware Adaptive Sampling Learning for Arbitrary-Oriented Object Detection," Expert Systems with Applications, vol. 238, p. 122242, 2024.

[11] E. Oksuz, C. Cam, E. Akbas, and F. Porikli, "Rank & Sort Loss for Object Detection and Instance Segmentation," in Proc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2021, pp. 2980-2989.

[12] H. Xu, X. Zhao, and W. Yu, "Adaptive Dynamic Non-Monotonic Focal IoU Loss for Object Detection," IEEE Access, vol. 12, pp. 105679–105692, 2024.

[13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature Pyramid Networks for Object Detection," arXiv:1612.03144, 2016.

[14] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path Aggregation Network for Instance Segmentation," arXiv:1803.01534, 2018.

[15] Code Huddle, "Improving Instance Segmentation Using Path Aggregation Network," Medium, Oct. 2019.

[16] B. Li, Y. Liu, W. Ouyang, et al., "Prime Sample Attention in Object Detection," arXiv:1904.04821, 2019.

[17] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: More deformable, better results,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316.

[18] B. Yang, G. Bender, Q. V. Le, and J. Ngiam, “CondConv: Conditionally parameterized convolutions for efficient inference,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.