Deep Reinforcement Learning: From Single-Agent to Multi-Agent
DOI:
https://doi.org/10.54097/my2bvn05Keywords:
Reinforcement Learning; Actor-Critic; Multi-Agent; Sim-to-Real; Self-Play.Abstract
Deep Reinforcement Learning (DRL), a core branch of artificial intelligence, integrates the representational capabilities of deep learning with the decision optimization mechanisms of reinforcement learning, enabling breakthroughs in complex tasks ranging from virtual environments to real-world scenarios. This paper systematically organizes the core algorithmic system of deep reinforcement learning, classifying it into four classic frameworks: Value-Based, Policy-Based, Actor-Critic, and Multi-Agent Reinforcement Learning (MARL). It analyzes the core ideas, improvement logics, and applicable scenarios of each algorithm. Meanwhile, focusing on the latest research findings in 2025, this paper explores the breakthroughs of reinforcement learning in directions such as real-world physics alignment, self-play-driven robustness, and autonomous driving, revealing the technical path of algorithms evolving from "virtual optimization" to "real-world reliability". It also discusses some remaining challenges like sample inefficiency and safety issues, which points out key directions to further enhance DRL’s real-world applicability. This review provides a systematic reference for related research.
Downloads
References
[1] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992, 8(3): 229-256.
[2] Sutton R S, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 1999, 12.
[3] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[4] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double q-learning. Proceedings of the AAAI conference on artificial intelligence. 2016, 30(1).
[5] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. International conference on machine learning. PMLR, 2016: 1995-2003.
[6] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[7] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[8] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International conference on machine learning. PMLR, 2018: 1861-1870.
[9] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. International conference on machine learning. PMLR, 2018: 1587-1596.
[10] Lowe R, Wu Y I, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 2017, 30.
[11] Yu C, Velu A, Vinitsky E, et al. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems, 2022, 35: 24611-24624.
[12] He T, Gao J, Xiao W, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv preprint arXiv:2502.01143, 2025.
[13] Cusumano-Towner M, Hafner D, Hertzberg A, et al. Robust autonomy emerges from self-play. arXiv preprint arXiv:2502.03349, 2025.
[14] Cornelisse D, Pandya A, Joseph K, et al. Building reliable sim driving agents by scaling self-play. arXiv preprint arXiv:2502.14706, 2025.
[15] Jaeger B, Dauner D, Beißwenger J, et al. Carl: Learning scalable planning policies with simple rewards. arXiv preprint arXiv:2504.17838, 2025.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

