A Review of Text-Based Pedestrian Retrieval Methods

Authors

  • Xiangmian Qiu
  • Yali Zhang
  • Yichen Zhao
  • Jinzhao Li

DOI:

https://doi.org/10.54097/3nkp6698

Keywords:

Pedestrian Retrieval, Cross-modal Retrieval, Image Text Matching, Visual Language Pre-training Models

Abstract

Text-based pedestrian retrieval task uses textual descriptions as query inputs to retrieve pedestrians in image gallery, which is crucial for social security and investigation. By combing the relevant working literature, we summarize the main methods in this field, which are grouped into five categories of methods based on feature matching, based on multi-granularity information, based on adversarial ideal, based on cross-modal attention, and based on visual text pre-training models, and this paper compare and analyze the design ideas, method features, advantages and disadvantages of classical model of each category. The performance of each model is compared using the commonly used datasets and evaluation metrics (TPR, mAP) for this task, the problems faced in field are discussed and future development trends are envisioned. Recently, the method based on VLP has become a mainstream, which can achieve higher precision retrieval, but still faces the problems of large number of model parameters and high difficulty of training, so it needs to explore the lightweight solution in the future.

Downloads

Download data is not yet available.

References

[1] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, "Person Search with Natural Language Description," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21-26 July 2017 2017: IEEE, pp. 5187-5196, doi: 10.1109/CVPR.2017.551.

[2] Z. Ding, C. Ding, Z. Shao, and D. Tao, "Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification," p. arXiv:2107.12666doi: 10.48550/arXiv. 2107. 12666.

[3] A. Zhu et al., "DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval," p. arXiv: 2109. 05534 doi: 10.48550/arXiv.2109.05534.

[4] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997, doi: 10.1162/neco.1997.9.8.1735.

[5] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, "Identity-Aware Textual-Visual Matching with Latent Co-attention," p. arXiv:1708.01988doi: 10.48550/arXiv.1708.01988.

[6] T. Chen, C. Xu, and J. Luo, "Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold," in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 12-15 March 2018 2018: IEEE, pp. 1879-1887, doi: 10.1109/WACV.2018.00208.

[7] Y. Zhang and H. Lu, "Deep Cross-Modal Projection Learning for Image-Text Matching," in European Conference on Computer Vision, Cham, 2018: Springer International Publishing, in Computer Vision – ECCV 2018, pp. 707-723, doi: 10.1007/978-3-030-01246-5_42.

[8] Y. Jing, C. Si, J. Wang, W. Wang, L. Wang, and T. Tan, "Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11189-11196, 04/03 2020, doi: 10.1609/aaai. v34i07.6777.

[9] K. Niu, Y. Huang, W. Ouyang, and L. Wang, "Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments," p. arXiv:1906.09610doi: 10. 48550/arXiv.1906.09610.

[10] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, "Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)," p. arXiv:1711.09349doi: 10.48550/arXiv.1711.09349.

[11] Z. Wang, Z. Fang, J. Wang, and Y. Yang, "ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language," p. arXiv:2005.07327doi: 10.48550/arXiv. 2005. 07327.

[12] J. Liu, Z.-J. Zha, R. Hong, M. Wang, and Y. Zhang, "Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search," presented at the Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 2019. [Online]. Available: https://doi.org/10. 1145/ 3343031. 3350991.

[13] N. Sarafianos, X. Xu, and I. A. Kakadiaris, "Adversarial Representation Learning for Text-to-Image Matching," p. arXiv:1908.10534doi: 10.48550/arXiv.1908.10534.

[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," p. arXiv:1810.04805doi: 10.48550/arXiv. 1810. 04805.

[15] S. Yan, N. Dong, L. Zhang, and J. Tang, "CLIP-Driven Fine-grained Text-Image Person Re-identification," p. arXiv:2210. 10276doi: 10.48550/arXiv.2210.10276.

[16] A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," p. arXiv:2103.00020doi: 10. 48550/ arXiv.2103.00020.

[17] D. Jiang and M. Ye, "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval," p. arXiv: 2303.12501doi: 10.48550/arXiv.2303.12501.

[18] W. Tan, C. Ding, J. Jiang, F. Wang, Y. Zhan, and D. Tao, "Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID," p. arXiv:2405.04940doi: 10.48550/arXiv. 2405.04940.

[19] Y. Wu, Z. Yan, X. Han, G. Li, C. Zou, and S. Cui, "LapsCore: Language-guided Person Search via Color Reasoning," in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 10-17 Oct. 2021 2021: IEEE, pp. 1604-1613, doi: 10.1109/ ICCV48922.2021.00165.

[20] X. Wu, W. Ma, D. Guo, T. Zhou, S. Zhao, and Z. Cai, "Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, pp. 6162-6170, 03/24 2024, doi: 10.1609/aaai. v38i6.28433.

[21] Y. Bai et al., "RaSa: relation and sensitivity aware representation learning for text-based person search," presented at the Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, P.R.China, 2023. [Online]. Available: https://doi.org/ 10.24963/ ijcai. 2023/62.

Downloads

Published

29-12-2024

Issue

Section

Articles

How to Cite

Qiu, X., Zhang, Y., Zhao, Y., & Li, J. (2024). A Review of Text-Based Pedestrian Retrieval Methods. Frontiers in Computing and Intelligent Systems, 10(3), 118-127. https://doi.org/10.54097/3nkp6698