2D Multi-Person Human Pose Estimation Based on Deep Learning

Ziyang Wang

doi:10.54097/png1t871

Authors

Ziyang Wang

DOI:

https://doi.org/10.54097/png1t871

Keywords:

Human pose estimation, Top-down approach, Bottom-up approach.

Abstract

Human pose estimation leverages methods for computer vision to automatically find and recognize the principal joints of the human body. In the last few years, numerous deep learning techniques for estimating human posture have made great strides. Among these, since it forms the basis for 3D human pose estimation, 2D human pose estimation is crucial. Top-down and bottom-up approaches are the two broad categories into which 2D multi-person pose estimation methodologies can be separated. The former detects each object in the input data and then performs key point localization on each object individually. Its advantages include high accuracy and suitability for scenarios involving a single person or a small, dispersed group of people; however, its disadvantages include reliance on detection technology, high computational requirements, and poor real-time performance. The bottom-up method initially focuses on detecting all the key points contained in the input data, and then combines these key points based on their spatial relationships to form complete skeletons of different objects. The advantage is that it doesn't rely on object detection, making it suitable for scenarios with many people and occlusions; however, the disadvantage is that key point grouping is complex and prone to mis-matching. Furthermore, contrasting the two approaches' outcomes on the COCO dataset, this paper analyzes their performance in specific scenarios. It also discusses the remaining issues and future research directions in this field.

References

[1]Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7291–7299).

[2]Dayarathna, T., Muthukumarana, T., Rathnayaka, Y., Denman, S., De Silva, C., Pemasiri, A., & Ahmedt-Aristizabal, D. (2023). Privacy-preserving in-bed pose monitoring: A fusion and reconstruction study. Expert Systems with Applications, 213, 119139.

[3]He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969).

[4]Kong, Y., Qin, Y., & Zhang, K. (2023). A review of deep learning 2D human pose estimation methods. Journal of Image and Graphics, 28(07), 1965–1989.

[5]Kreiss, S., Bertoni, L., & Alahi, A. (2019). Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11977–11986).

[6]Li, J., Wang, D., & Zhang, S. (2024). 2D human pose estimation based on deep learning: Current status and prospects. Chinese Journal of Computers, 47(01), 231–250.

[7]Liu, S., He, N., Wang, C., & Yu, H. (2021). Research progress of 2D human pose estimation. In (eds.) Proceedings of the 25th Annual Conference on New Network Technologies and Applications, Network Application Branch, China Computer Users Association (pp. 255–258). Beijing Key Laboratory of Information Service Engineering, Beijing Union University.

[8]Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012, June). Articulated people detection and pose estimation: Reshaping the future. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 3178–3185). IEEE.

[9]Qi, X., Xu, C., Zeng, F., & Yao, D. (2024). A review of human pose estimation based on deep learning. In (eds.) Proceedings of the 28th Annual Conference on New Network Technologies and Applications, Network Application Branch, China Computer Users Association (pp. 280–283). Beijing Key Laboratory of Information Service Engineering, Beijing Union University; Beijing Laboratory of Brain and Cognitive Intelligence, School of Robotics, Beijing Union University.

[10]Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820–2828).

[11]Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 466–481).

[12]Yang, Y., & Ramanan, D. (2012). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.