Research on Robot Visual Perception and Object Recognition Based on Deep Learning

Xiaorui Liu

doi:10.54097/rhwged92

Authors

Xiaorui Liu

DOI:

https://doi.org/10.54097/rhwged92

Keywords:

Robot Visual Perception, Object Recognition, Deep Learning, Small and Medium-Sized Scenarios, YOLOv8-nano, Embedded Deployment

Abstract

Robots in small to medium-sized scenarios (such as light industrial sorting and desktop operations) face visual challenges, including occlusion, lighting fluctuations, and high positioning accuracy requirements, while traditional methods and general deep learning models fall short in balancing robustness and performance. This study proposes a solution that integrates dataset optimization, model improvement, and embedded deployment. A hybrid dataset (8 categories, >5,000 samples) was constructed using curated COCO data (style/size standardized) and self-collected images (annotation accuracy ≥98%). YOLOv8-nano was optimized with an SE module and combined with gamma correction and few-shot fine-tuning. The results show an average mAP >78% (≥72% under occlusion/lighting fluctuations, 8%-10% improvement over the baseline) and positioning error ≤6mm. Deployment on a Raspberry Pi 4B (INT8 quantization) achieved ≥22 FPS. The study is limited by the small number of categories and lack of dynamic testing; future work will expand the dataset and add tracking capabilities.

Downloads

Download data is not yet available.

References

[1] Han, X., Chen, S., Fu, Z., Feng, Z., Fan, L., An, D., Wang, C., Guo, L., Meng, W., Zhang, X., Xu, R., & Xu, S. (2026). Multimodal fusion and vision–language models: A survey for robot vision. Information Fusion, 126, 103652. https://doi. org/ 10. 1016/j.inffus.2025.103652.

[2] Li, Y., Guo, K., Lu, Y., & Liu, L. (2021). Cropping and attention based approach for masked face recognition. Applied Intelligence, 51(5), 3012–3025. https://doi.org/ 10.1007/ s10489-020-02100-9.

[3] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., & Research, S. (2021). Published as a conference paper at ICLR 2021 DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION.

[4] Carion, N., Usunier, N., Massa, F., Kirillov, A., Synnaeve, G., Zagoruyko, S., 2020., End-to-End Object Detection with Transformers. In: European Conference on Computer Vision (ECCV). Cham. Glasgow. pp. 213-229.

[5] Li, Y., Guo, K., Lu, Y., & Liu, L. (2021). Cropping and attention based approach for masked face recognition. Applied Intelligence, 51(5), 3012–3025. https://doi.org/10. 1007/ s10 489-020-02100-9.

[6] Chen, R., Qiu, T., Yang, L., Yu, T., Jia, F., & Chen, C. (2024). A method for dense occlusion target recognition of service robots based on improved YOLOv7. Optics and Precision Engineering, 32(10), 1595–1605. https://doi.org/ 10. 37188/ ope. 20243210.1595.

[7] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/cvpr.2016.90.

[8] Sun, Z., Guo, X., Zhang, X., Han, J., & Hou, J. (2021). Research on robot target recognition based on deep learning. Journal of Physics: Conference Series, 1948(1), 012056. https:// doi.org/10.1088/1742-6596/1948/1/012056.

[9] Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement.