Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We provide a provably-efficient model-based algorithm for online Markov games that incentives exploration without uncertainty quantification.
Abstract: Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.
Lay Summary: Multi-agent reinforcement learning (MARL) in unknown environments presents a significant challenge in designing efficient exploration strategies. While common approaches incentivize exploration through explicit bonus terms derived from uncertainty quantification, constructing these bonuses often becomes intractable, particularly with complex function approximation or numerous agents. This paper introduces VMG (Value-incentivized Markov Game solver), a novel model-based algorithm that promotes exploration by biasing the empirical estimate of the game's model parameters. Specifically, VMG favors model parameters associated with higher collective best-response values for all players, thereby encouraging policy deviation from current equilibria without explicit bonus functions. This methodology allows for simultaneous, uncoupled policy updates and is theoretically established to achieve near-optimal regret for finding Nash and Coarse Correlated Equilibria in Markov games, comparable to methods relying on sophisticated uncertainty quantification.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: MARL theory
Submission Number: 7828
Loading