Towards Provably Efficient Learning of Extensive-Form Games with Imperfect Information and Linear Function Approximation

Canzhe Zhao; Shuze Chen; Weiming Liu; Haobo Fu; QIANG FU; Shuai Li

Towards Provably Efficient Learning of Extensive-Form Games with Imperfect Information and Linear Function Approximation

Canzhe Zhao, Shuze Chen, Weiming Liu, Haobo Fu, QIANG FU, Shuai Li

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Extensive-Form Games, Partially observable Markov games

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present the first line of algorithms for provably efficient learning of extensive-form games with imperfect information and linear function approximation.

Abstract: We study two-player zero-sum imperfect information extensive-form games (IIEFGs) with linear functional approximation. In particular, we consider linear IIEFGs in the formulation of partially observable Markov games (POMGs) with known transition and bandit feedback, in which the reward function admits a linear structure. To tackle the partial observation of this problem, we propose a linear loss estimator based on the \textit{composite} features of information set-action pairs. Through integrating this loss estimator with the online mirror descent (OMD) framework and delicate analysis of the stability term in the linear case, we prove the $\widetilde{\mathcal{O}}(\sqrt{HX^2d\alpha^{-1}T})$ regret upper bound of our algorithm, where $H$ is the horizon length, $X$ is the cardinality of the information set space, $d$ is the ambient dimension of the feature mapping, and $\alpha$ is a quantity associated with an exploration policy. Additionally, by leveraging the ``transitions" over information set-actions, we propose another algorithm based on the follow-the-regularized-leader (FTRL) framework, attaining a regret bound of $\widetilde{\mathcal{O}}(\sqrt{H^2d\lambda T})$, where $\lambda$ is a quantity depends on the game tree structure. Moreover, we prove that our FTRL-based algorithm also achieves the $\widetilde{\mathcal{O}}(\sqrt{HXdT})$ regret with a different initialization of parameters. To the best of our knowledge, we present the first line of algorithms studying learning IIEFGs with linear function approximation.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5362

Loading