Keywords: Matrix Games, Bandit Learning, Evolutionary Algorithms, Regret Analysis
Abstract: Learning in games is a fundamental problem in machine learning and artificial intelligence, with many successful applications (Silver et al., 2016; Schrittwieser et al., 2020). We consider the problem of learning in matrix games, where two players engage in a two-player zero-sum game with an unknown payoff matrix and bandit feedback. In this setting, players can observe their actions and the corresponding (noisy) payoffs at each round. This problem has been studied in the literature, and several algorithms have been proposed to address it (O’Donoghue et al., 2021; Maiti et al., 2023; Cai et al., 2023). In particular, O’Donoghue et al. (2021) demonstrated that deterministic optimism (e.g., the UCB algorithm for matrix games) plays a central role in achieving sublinear regret and outperforms other algorithms. However, despite numerous applications, the theoretical understanding of learning in matrix games remains underexplored. Specifically, it remains an open question whether randomised optimism can also exhibit sublinear regret.
In this paper, we propose a novel algorithm called Competitive Co-evolutionary Bandit Learning (CoEBL) for unknown two-player zero-sum matrix games. By integrating evolutionary algorithms (EAs) into the bandit framework, CoEBL introduces randomised optimism through the variation operator of EAs. We prove that CoEBL also enjoys sublinear regret, matching the regret performance of algorithms based on deterministic optimism (O’Donoghue et al., 2021). To the best of our knowledge, this is the first work that provides a regret analysis of an evolutionary bandit learning algorithm in matrix games. Empirically, we compare CoEBL with classical bandit algorithms, including EXP3 (Auer et al., 2002), the variant of EXP3-IX (Cai et al., 2023), and UCB algorithms analysed in O’Donoghue et al. (2021) across several matrix game benchmarks. Our results show that CoEBL not only enjoys sublinear regret, but also outperforms existing methods in various scenarios. These findings reveal the promising potential of evolutionary bandit learning in game-theoretic settings, in particular, the effectiveness of randomised optimism via evolutionary algorithms.
Supplementary Material: pdf
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7831
Loading