Lang2Vision Diffusion: Language-Driven Diffusion for Robotic Action Frame Prediction

Authors

  • Guanyu Chen

DOI:

https://doi.org/10.54097/pw7r5392

Keywords:

Robotic Vision, Prediction, Diffusion Model

Abstract

We address the challenge of enabling robots to predict the visual outcomes of their actions through Lang2Vision Diffusion (L2V-Diff), a novel adaptation of InstructPix2Pix for robotic action frame prediction. Our framework takes an initial RGB observation paired with natural language instructions and generates photorealistic images of anticipated future states via vision-language conditioned diffusion. The method is fine-tuned on synthetic RoboTwin data (300 episodes across hammering, handover, and stacking tasks), demonstrating strong quantitative performance (mean Structural Similarity: 0.971, Peak Signal-to-Noise Ratio: 37.1 dB). By bridging high-level instructions with pixel-accurate visual prediction, L2V-Diff advances safety-critical robotic applications while eliminating the need for explicit 3D reconstruction or physics simulation.

Downloads

Download data is not yet available.

References

[1] Chelsea Finn, Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Confer-ence on Robotics and Automation (ICRA), 2786–27993, 2017.

[2] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.

[3] Tsung-Wei Ke, Nikolaos Gkanatsios, Katerina Fragkiadaki. 3D Diffuser Actor: Policy diffusion with 3D scene representations. arXiv preprint arXiv:2402.10885, 2024.

[4] Yuxin He, Qiang Nie. ManiTrend: Bridging Future Generation and Action Prediction with 3D Flow for Robotic Manipulation. arXiv preprint arXiv:2502.10028, 2025.

[5] Rosa Wolf, Yitian Shi, Sheng Liu, Rania Rayyes. Diffusion Models for Robotic Manipulation: A Survey. arXiv preprint arXiv:2504.08438, 2025.

[6] Jessica E. Liang. Diffusion Models for Robotics. In Pro-ceedings of the AAAI Conference on Artificial Intelligence, 39:29587–29589, 2025.

[7] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets. arXiv preprint arXiv:2504.02792, 2025.

[8] Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, Xiaodan Liang. VidMan: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. Advances in Neural Information Processing Systems, 37:41051–41075, 2024.

[9] Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, et al. Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision. arXiv preprint arXiv:2504.02477, 2025.

[10] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. CoT-VLA: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025.

[11] Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner. Pre-trained text-to-image diffusion models are versatile representation learners for control. Advances in Neural Information Processing Systems, 37:74182–74210, 2024.

[12] Yiping Zhang, Kolja Wilker. Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU. Journal of Organizational and End User Computing (JOEUC), 36(1):1–21, 2024.

[13] Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109, 2024.

[14] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman. DreamBooth: Fine-tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22500–22510, 2023.

[15] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.

[16] David Ha, Andrew Dai, Quoc V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.

Downloads

Published

29-12-2025

Issue

Section

Articles

How to Cite

Chen , G. (2025). Lang2Vision Diffusion: Language-Driven Diffusion for Robotic Action Frame Prediction. Frontiers in Computing and Intelligent Systems, 14(3), 24-28. https://doi.org/10.54097/pw7r5392