## Learning in two-player zero-sum partially observable Markov games with perfect recall

21 May 2021, 20:52 (modified: 26 Oct 2021, 19:21)NeurIPS 2021 PosterReaders: Everyone
Keywords: online learning, reinforcement learning, Nash equilibrium, Markov games
TL;DR: Self-play online learning for two-player zero-sum, tabular, episodic, partially observable Markov games
Abstract: We study the problem of learning a Nash equilibrium (NE) in an extensive game with imperfect information (EGII) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular EGII under the \textit{perfect-recall} assumption where the only feedback is realizations of the game (bandit feedback). In particular the \textit{dynamics of the EGII is not known}---we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on convergence rate to the NE of order $1/\sqrt{T}$ where~$T$ is the number of played games. Moreover IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.
Supplementary Material: pdf
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
14 Replies