Reinforcement learning for saddle-point equilibria without full state exploration

ICLR 2026 Conference Submission15882 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, zero-sum games, Q-learning
TL;DR: We propose an algorithm that can find and certify saddle-point policies for zero-sum turn games without fully exploring the state space, with design and analysis built upon a novel fixed-point condition on the Q-function.
Abstract: We introduce a new fixed-point condition on the state-action-value $Q$-function for zero-sum Markov turn games that suffices to construct saddle-point and security policies, but is less restrictive than the classical condition arising from the Bellman equation. We then propose an iterative algorithm that guarantees convergence to a function satisfying this less restrictive condition. The key benefit of the new condition and algorithm is that convergence to a saddle-point can (and typically will) be reached without full exploration of the state-space; generally enabling the solution of larger games with less computation. Our algorithm is based on a limited form of exploration that gathers samples from repeated attempts to certify the current candidate policies as a saddle-point, motivating the terminology "saddle-point exploration" (SPE). We illustrate the use of the new condition/algorithms in several combinatorial games that can be scaled in terms of the size of the state and action spaces. Numerical results, using both tabular and neural network $Q$-function representations, consistently show that saddle-point policies can be formally certified without full state exploration and, for several games, we can see that the fraction of states explored decreases as the size of the game grows.
Primary Area: reinforcement learning
Submission Number: 15882
Loading