Research and Analysis of Robot Grasping Deep Learning: From End-to-End Model to Pre Training Foundation

Pengshu Ma

doi:10.54097/0exfh648

Authors

Pengshu Ma

DOI:

https://doi.org/10.54097/0exfh648

Keywords:

Robotic Grasping, Deep Learning, End-to-End Learning, Multi-modal Fusion, Sim-to-Real Transfer.

Abstract

Robotic grasping, a cornerstone of autonomous manipulation, has been profoundly transformed by deep learning. This survey provides a comprehensive overview of the current state of deep learning-based methods for robotic grasping, highlighting the paradigm shift from traditional multi-stage pipelines to data-driven approaches (with end-to-end learning as a core branch). The paper systematically categorizes and analyzes key methodologies, including end-to-end grasp estimation (covering both planar and 6-DoF spatial grasping), multi-modal fusion (RGB-D, vision-language), reinforcement and imitation learning, and the emerging application of large-scale pre-trained models. The review synthesizes findings from prominent datasets (e.g., Cornell, GraspNet-1Billion) and evaluates performance against core metrics like grasp success rate, inference time, and generalization ability. Crucially, this survey emphasizes the practical applicability of these technologies, linking them to specific real-world scenarios such as industrial bin-picking and domestic service tasks. Despite significant progress, critical challenges persist, such as the sim-to-real gap, limited generalization, and the trade-off between real-time performance and computational cost. We discuss these open challenges and outline promising, executable future directions—such as domain adaptation for sim-to-real transfer and model lightweighting for edge deployment—to bridge the gap between academic research and industrial deployment.

References

[1] Redmon J, & Angelova A. Deep learning for detecting robotic grasps. arXiv preprint arXiv:1503.00797, 2015.

[2] Su K, Li J L, Li J G, Zhang X W, Liu C. A survey on vision-based robotic end-to-end strategy grasping estimation. Information and Control, 2025, 54(3): 372-389.

[3] Mahler J, et al. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. The International Journal of Robotics Research, 2017, 37(4-5): 286-304.

[4] Jiang S, Zhu Y, Jiang Y, & Yuan J. GraspNet: A large-scale clustered dataset for diverse grasp type recognition. IEEE Robotics and Automation Letters, 2018, 3(3): 2698-2705.

[5] Mousavian A, Kashinath A, Saxena A. 6D Object Pose Estimation in Cluttered Scenes Using Multi-View CNNs. arXiv preprint arXiv:1905.02693, 2019.

[6] Sun D, et al. Multi-Task Learning for Robotic Grasping in Cluttered Scenes. IEEE Transactions on Robotics, 2024, 40(2): 429-444.

[7] Kumra S, Joshi S, & Sahin F. Antipodal robotic grasping using generative residual convolutional neural network. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021: 9626-9633.

[8] Singh A, Nair S, Chen D, et al. CLIPORT: Language-Guided Robot Manipulation. IEEE Robotics and Automation Letters, 2021, 6(4): 8979-8986.

[9] Levine S, et al. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. The International Journal of Robotics Research, 2016, 37(4-5): 421-436.

[10] Chen T, Lu Y, & Gupta A. Diffusion Policies: Flexible Generation of Robot Motion Policies via Diffusion Models. IEEE Transactions on Robotics, 2023, 39 (5): 3156-3172.

[11] Driess D, et al. PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378, 2023.

[12] Zeng A, et al. GraspVLA: A Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data. arXiv preprint arXiv:2505.03233, 2025.

[13] Miller A, Allen K, & Kumar V. GelSight: High-Resolution Robot Tactile Sensing for Estimating Geometry and Force. The International Journal of Robotics Research, 2019, 38(11-12): 1275-1296.