The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

Côme Fiegel; Pierre Menard; Tadashi Kozuno; Michal Valko; Vianney Perchet

The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

Côme Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, Vianney Perchet

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that learning how to play a game is sometimes harder when using uncoupled algorithms.

Abstract: We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performances, with the best attainable rate being $\mathcal{O}(T^{-1/4})$ in contrast to the usual $\mathcal{O}(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate. The first algorithm leverages a straightforward tradeoff between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.

Lay Summary: We study the design of efficient algorithms that learn optimal strategies for playing games. We focus on a setting where two independent algorithms learn by repeatedly playing against each other, without any communication. Most existing works assume that the algorithms can freely play against each other and only output a final good strategy. In this article, we add a constraint: the strategies must improve over time. Especially, the algorithms are not allowed to submit a potentially "dumb" strategy at a given time as a test. We show that this makes learning harder and propose two methods that are mathematically near-optimal. This research can be applied to learning actual games such as poker, but not only, as many important machine learning problems can be formulated this way. For example, when two language models propose a response, the winner can be the one chosen by the user.

Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits

Keywords: Game theory, Bandit, Last-iterate convergence, Online learning

Submission Number: 15967

Loading