Balancing Exploration and Exploitation: Insights from ETC, UCB, and Thompson Sampling

Authors

  • Xinyu Cao

DOI:

https://doi.org/10.54097/dknft795

Keywords:

Exploration-exploitation, Algorithm environment, Practical application.

Abstract

The multi-armed bandit problem is a core issue that cannot be ignored in online decision-making. It is widely applied in areas such as advertising recommendation, clinical medical experiments, and financial investment optimization. This paper examines three classic algorithms for multi-armed slot machines: Explore-Then-Commit (ETC), Upper Confidence Bound (UCB), and Thompson Sampling (TS). First, the basic principles and operation mechanisms of each algorithm will be introduced, and the related cumulative regrets will be analyzed. By simulating different environments for different algorithms, the stability related to these algorithms can be verified. This further validates the advantages of the ETC, UCB and TS algorithms in the exploration-exploitation process. The results show that the ETC algorithm is simple and easy to implement, the UCB algorithm has excellent theoretical performance and does not require prior information, and the TS algorithm is more efficient in exploration, making it suitable for dynamic environments. This article provides theoretical support and practical examples for understanding and applying the multi-armed slot machine algorithm.

Downloads

Download data is not yet available.

References

[1] Robbins, Herbert. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952, 58(5): 527–535.

[2] Shukla, Avi L. Multi-armed bandit study of exploration vs. exploitation. 2022.

[3] Auer, Peter; Cesa-Bianchi, Nicolò; Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002, 47(2–3): 235–256.

[4] Gonçalves, Richard A.; Almeida, Carolina P.; Pozo, Aurora. Upper confidence bound (UCB) algorithms for adaptive operator selection in MOEA/D. In: International Conference on Evolutionary Multi-Criterion Optimization. Cham: Springer International Publishing, 2015.

[5] Nguyen-Thanh, Nhan, et al. Recommendation system-based upper confidence bound for online advertising. arXivpreprint arXiv:1909.04190, 2019.

[6] Jourdan, Marc; Degenne, Rémy. Non-asymptotic analysis of a UCB-based top two algorithm. Advances in Neural Information Processing Systems, 2023, 36: 68980–69020.

[7] Garivier, Aurélien; Cappé, Olivier. The KL-UCB algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th Annual Conference on Learning Theory. JMLR Workshop and Conference Proceedings, 2011.

[8] Cappé, Olivier; Garivier, Aurélien; Maillard, Olivier; Munos, Rémi; Stoltz, Gilles. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 2013, 41(3): 1516–1541. Project Euclid

[9] Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933, 25(3/4): 285–294. Oxford AcademicSciSpace.

[10] Agrawal, Shipra; Goyal, Navin. Analysis of Thompson Sampling for the multi-armed bandit problem. In: Proceedings of the 25th Annual Conference on Learning Theory (COLT), JMLR Proceedings, 2012, 23: 39.1–39.26.

[11] Agrawal, Shipra; Goyal, Navin. Thompson Sampling for contextual bandits with linear payoffs. In: International Conference on Machine Learning (ICML), 2013.

[12] Nasiri, Mohammad Mahdi; Kianfar, Farhad. A GES/TS algorithm for the job shop scheduling. Computers & Industrial Engineering, 2012, 62(4): 946–952.

Downloads

Published

29-01-2026

Issue

Section

Articles

How to Cite

Cao , X. (2026). Balancing Exploration and Exploitation: Insights from ETC, UCB, and Thompson Sampling. Academic Journal of Science and Technology, 19(2), 19-22. https://doi.org/10.54097/dknft795