Constrained Exploitability Descent: An Offline Reinforcement Learning Method for Finding Mixed-Strategy Nash Equilibrium

Runyu Lu; Yuanheng Zhu; Dongbin Zhao

Constrained Exploitability Descent: An Offline Reinforcement Learning Method for Finding Mixed-Strategy Nash Equilibrium

Runyu Lu, Yuanheng Zhu, Dongbin Zhao

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes Constrained Exploitability Descent (CED), a model-free offline reinforcement learning (RL) algorithm for solving adversarial Markov games (MGs). CED combines the game-theoretical approach of Exploitability Descent (ED) with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find the side effect less detrimental in adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games. We further prove that the min-player policy at the stationary point follows the property of mixed-strategy Nash equilibrium in MGs. Compared to the model-based ED method that optimizes the max-player policy, our CED method no longer relies on a generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that CED can find an equilibrium policy for the min-player as long as the offline dataset guarantees uniform coverage. Besides, CED achieves a significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform data coverages. When combined with neural networks, CED also outperforms behavior cloning and offline self-play in a large-scale two-team robotic combat game.

Lay Summary: We usually solve adversarial games through iterative policy updates under the game model or online sampling. However, an exact game model or a large amount of online data can be expensive to obtain in the real world, especially when it comes to serious games. Therefore, we propose a novel algorithm named Constrained Exploitability Descent (CED) to solve adversarial games offline. By using techniques from offline reinforcement learning and game theory, CED can find the best strategy in adversarial games using only a finite number of game examples. We provide both theoretical evidence and simulation results to demonstrate that CED works well and has excellent properties compared to other descent-based methods. According to our theory, the final results of CED get close to the mixed-strategy Nash equilibrium as long as there is enough game data. We further verify this assertion in matrix games, a tree-form game, and a soccer game. Even when the data coverage is theoretically insufficient, CED still gradually improves over the underlying policy of these data. When equipped with neural networks, CED is also applicable to large-scale games and clearly outperforms the existing offline methods in a two-team robotic combat game.

Primary Area: Reinforcement Learning->Batch/Offline

Keywords: offline reinforcement learning, adversarial Markov game, mixed-strategy Nash equilibrium, policy constraint, exploitability descent

Submission Number: 9335

Loading