Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

Mengfan Xu; Diego Klabjan

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

Mengfan Xu, Diego Klabjan

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Abstract: EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/regret-bounds-and-reinforcement-learning/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=vM6p313pXM

7 Replies

Loading