Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Shicong Cen; Yuejie Chi; Simon Shaolei Du; Lin Xiao

Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Shicong Cen, Yuejie Chi, Simon Shaolei Du, Lin Xiao

Published: 01 Feb 2023, Last Modified: 02 Mar 2023ICLR 2023 posterReaders: Everyone

Keywords: zero-sum Markov game, entropy regularization, policy optimization, global convergence, multiplicative updates

TL;DR: We achieve better last-iterate convergence result of policy optimization for two-player zero-sum Markov games, with single-loop and symmetric update rules.

Abstract: Multi-Agent Reinforcement Learning (MARL)---where multiple agents learn to interact in a shared dynamic environment---permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges and new desiderata, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)

Supplementary Material: zip

12 Replies

Loading