Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

Ian Gemp; Andreas Alexander Haupt; Luke Marris; Siqi Liu; Georgios Piliouras

Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

Ian Gemp, Andreas Alexander Haupt, Luke Marris, Siqi Liu, Georgios Piliouras

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: convex MDPs + multi-agent = convex Markov Game. Non-trivial Nash existence proof. Powerful notions of imitation, exploration, and more enabled by occupancy-measure view.

Abstract: Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.

Lay Summary: In a traditional reinforcement learning problem or Markov decision process (MDP), a single agent aims to find a policy to maximize their discounted sum of rewards (also called return). Prior work has already considered extending this traditional problem to the multi-agent setting (called a Markov game) where every agent is *simultaneously* maximizing their return (i.e., they are at a Nash equilibrium). Research in single agent MDPs has since moved beyond utilities that encode the traditional discounted sum of rewards, an example utility being the entropy of an agent's visitation of states in the environment used to encourage learning a good exploration policy. This more general MDP is sometimes called a convex MDP. What can we say about a similar multi-agent extension of MDPs with generalized utilities? Do Nash equilibria even exist? What kinds of domains can we model with such a framework and can it help us uncover interesting multi-agent behavior? To that end, we formulate the convex Markov game. Certain properties of a convex Markov game break previous assumptions leveraged to prove existence of Nash equilibria. We appeal to more general techniques to prove Nash equilibria indeed exist. In addition, measuring how "close" a multi-agent system is to a Nash equilibrium is more expensive than in a vanilla Markov game. Therefore, we propose a cheap alternative that upper bounds how close the multi-agent system is to equilibrium. To solve for Nash equilibria, we simply employ this upper bound as a loss function and descend it with gradient descent. The convex Markov game framework has many applications including helping multi-agent systems better explore their environment, more closely imitate desired target behavior, avoid "unsafe" regions of the environment, and more.

Primary Area: Theory->Game Theory

Keywords: Markov Decision Process, Nash Equilibrium, Markov Game, N-player General-Sum, Convex Optimization, Occupancy Measure

Submission Number: 751

Loading