A Survey of Multi-Armed Bandit Algorithms: From Theoretical Foundations to Modern Applications
DOI:
https://doi.org/10.54097/qtvt0603Keywords:
Multi-armed bandit, Reinforcement learning, Exploration-exploitation, Thompson sampling, Upper Confidence Bound.Abstract
The multi-armed bandit (MAB) problem provides a canonical formulation of sequential decision-making under uncertainty, capturing the exploration-exploitation trade-off. This survey charts the intellectual lineage of MAB algorithms, ranging from their statistical roots to contemporary use as part of modern artificial intelligence. We begin by laying the mathematical groundwork for the problem in statistical terms. Building on this, we survey algorithms for the classical stochastic setting, covering simple heuristics such as -greedy to principled approaches based on optimism (Upper Confidence Bound, UCB) and Bayesian inference (Thompson Sampling). We then discuss the contextual bandit as the tipping point in which side information is introduced to enable large-scale personalization, and conclude with recent trends in bandit research, including adaptations to address real-world constraints, non- stationary environments, and the new frontier of fairness considerations. Overall, this review aims to showcase the broad applicability of bandit algorithms to problems ranging from recommendation systems to clinical trials, and argue for the continuing relevance of bandit methods.
Downloads
References
[1] Robbins, H. Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society, 1952, 58 (5): 527-535.
[2] Bouneffouf, D., Rish, I., & Aggarwal, C. Survey on Applications of Multi-Armed and Contextual Bandits. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 2020.
[3] Lai, T. L., & Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 1985, 6 (1): 4-22.
[4] Bubeck, S., & Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends® in Machine Learning, 2012, 5 (1): 1–122.
[5] Auer, P., Cesa-Bianchi, N., & Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002, 47: 235–256.
[6] Vermorel, J., & Mohri, M. Multi-armed Bandit Algorithms and Empirical Evaluation. In European Conference on Machine Learning (ECML 2005), 2005: 437-448.
[7] Thompson, W. R. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933, 25 (3/4): 285–294.
[8] Scott, S. L. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010, 26 (6): 639–658.
[9] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. A Tutorial on Thompson Sampling. Foundations and Trends® in Machine Learning, 2018, 11 (1): 1–96.
[10] Kuleshov, V., & Precup, D. Algorithms for the multi-armed bandit problem. Journal of Machine Learning Research, 2000, 1: 1–48.
[11] Slivkins, A. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning, 2019, 12 (1-2): 1–286.
[12] Li, L., Chu, W., Langford, J., & Schapire, R. E. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW '10), 2010: 661–670.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Academic Journal of Science and Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.








