Solving Zero-Sum Convex Markov Games

Fivos Kalogiannis; Emmanouil-Vasileios Vlatakis-Gkaragkounis; Ian Gemp; Georgios Piliouras

Solving Zero-Sum Convex Markov Games

Fivos Kalogiannis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ian Gemp, Georgios Piliouras

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Efficient Policy Gradient Methods to compute Nash Equilibria in Zero-sum Convex Markov Games

Abstract: We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al.(2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. Our results follow a two-step approach. First, leveraging properties of hidden-convex–hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex–proximal Polyak-Łojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.

Lay Summary: Convex Markov games model a number of applications spanning multi-agent robotic exploration, improving creativity in machinic Chess playing, language model alignment and more. We propose the first algorithm to solve two-player zero-sum games of this kind with a simple twist of conventional RL algorithms, namely, policy gradient. The convergence is guaranteed by regularization and alternating gradient updates for the minimizing and maximizing variables.

Primary Area: Theory->Game Theory

Keywords: Markov games, convex RL, regularization, Polyak Lojasiewicz condition, proximal Polyak Lojasiewicz, nonconvex min-max optimization, zero-sum games, gradient domination, hidden convxity, composite optimization, min-max optimization, policy gradient methods, alternating gradient descent-ascent, gradient descent ascent, two-timescale gradient descent-ascent, nested gradient iterations, policy gradient methods

Flagged For Ethics Review: true

Submission Number: 12484

Loading