Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Canzhe Zhao; Yutian Cheng; Jing Dong; Baoxiang Wang; Shuai Li

Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Canzhe Zhao, Yutian Cheng, Jing Dong, Baoxiang Wang, Shuai Li

Published: 01 May 2025, Last Modified: 07 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: The first last-iterate convergence learning of IIEFGs in the bandit-feedback setting.

Abstract: We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

Lay Summary: In competitive games like poker, players must make decisions without knowing their opponent’s moves—a scenario called imperfect information. Finding optimal strategies (Nash equilibria) in such games is challenging, especially when players only learn from their own gameplay (bandit feedback), rather than observing the full game dynamics. Our work introduces a new algorithm that helps players efficiently improve their strategies over time, ensuring they converge to a near-optimal solution with guaranteed performance. Unlike prior methods, which either require full knowledge of the game or lack convergence guarantees, our approach uses a carefully designed regularization technique to balance exploration and exploitation. We prove that our method converges reliably, even with limited feedback, and demonstrate its effectiveness across various game scenarios. This could have applications in AI for games, strategic decision-making, and even real-world negotiations where information is incomplete. In short: we propose a practical way for AI (or humans) to learn strong strategies in competitive, hidden-information settings—with theoretical guarantees and strong empirical results.

Primary Area: Reinforcement Learning->Online

Keywords: Extensive-form games; partially observable Markov games (POMGs); last-iterate convergence

Submission Number: 4292

Loading