A Comparative Analysis of Cumulative Regret Based on Multi-Armed Bandit Algorithms

Muqing Xue

doi:10.54097/schj0v87

Authors

Muqing Xue

DOI:

https://doi.org/10.54097/schj0v87

Keywords:

Multi-Armed Bandit, Cumulative Regret, Thompson Sampling, ETC, UCB.

Abstract

This study aims to conduct a detailed comparison of the performance of three classic Multi-Armed Bandit algorithms: Thompson Sampling, UCB, and ETC. The MAB problem, as an important sequential decision-making framework, primarily challenges lie in how to strike a balance between "exploration" and "exploitation". We quantitatively analyzed each algorithm's long-term performance using 100 independent experiments and cumulative regret as the primary metric. The experimental findings demonstrate that the three algorithms' performance varies significantly in the tested context. The Thompson sampling method performed the best, with the least increase in regret and the lowest final value. The UCB algorithm performed second-best, with regret growing logarithmically. The ETC algorithm saw rapid accumulation of regret in the early stages before stabilizing, but it had the poorest performance because it lacked the ability for continuous exploration. These findings confirm that the Thompson sampling method is the most efficient in balancing exploration and exploitation, and is the best choice for solving such Random Stationary Multi-Armed Bandit Problem.

Downloads

Download data is not yet available.

References

[1] J. Zhao, “Comparison of Multi-Armed Bandit Algorithms in advertising recommendation Systems,” Applied and Computational Engineering, vol. 83, no. 1, pp. 62–71, Oct. 2024, doi: 10.54254/2755-2721/83/2024glg0074.

[2] S. Zhang, “Optimizing data filtering in Multi-Armed Bandit Algorithms for reinforcement learning,” ITM Web of Conferences, vol. 73, p. 01024, Jan. 2025, doi: 10.1051/itmconf/20257301024.

[3] H. Qi, F. Guo, and L. Zhu, “Thompson Sampling for Non-Stationary bandit Problems,” Entropy, vol. 27, no. 1, p. 51, Jan. 2025, doi: 10.3390/e27010051.

[4] Y. Lei, “Comparative Evaluation of Mean Cumulative Regret in Multi-Armed Bandit Algorithms: ETC, UCB, Asymptotically Optimal UCB, and TS,” ITM Web of Conferences, vol. 73, p. 01026, Jan. 2025, doi: 10.1051/itmconf/20257301026.

[5] “MovieLens 1M Dataset,” GroupLens, Mar. 02, 2021. https://grouplens.org/datasets/movielens/1m/

[6] J.-M. Kim, “LLM-Guided Ensemble Learning for Contextual Bandits with Copula and Gaussian Process Models,” Mathematics, vol. 13, no. 15, p. 2523, Aug. 2025, doi: 10.3390/math13152523.

[7] X. Zhan, “A review of Multi-Armed Bandit Algorithms in player modeling and game design,” ITM Web of Conferences, vol. 73, p. 02016, Jan. 2025, doi: 10.1051/itmconf/20257302016.

[8] N. Y. Zhou, “The application of deep network and reinforcement learning for art design in graphical user interface wireframe generation,” Automatic Control and Computer Sciences, vol. 59, no. 3, pp. 402–415, Jun. 2025, doi: 10.3103/s0146411625700567.