Simultaneously Perturbed Optimistic Gradient Methods for Payoff-Based Learning in Games

Simultaneously Perturbed Optimistic Gradient Methods for Payoff-Based Learning in Games

ICLR 2026 Conference Submission17944 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Game theory, learning theory, multi-agent reinforcement learning, two-timescales stochastic approximation, stochastic approximation, monotone games, Nash equilibrium

TL;DR: This article presents a payoff-based optimistic-gradient-style learning algorithm that exhibits fast convergence in monotone games, using a new two-timescale stochastic approximation technique to beat previous performance bounds.

Abstract: We examine the long-run behavior of learning in a repeated game where the agents operate in a low-information environment, only observing their own realized payoffs at each stage. We study this problem in the context of monotone games with unconstrained action spaces, where standard optimistic gradient schemes might lead to cycles of play, even with perfect gradient information. To account for the fact that only a single payoff observation can be made at each iteration—and no gradient information is directly observable—we design and deploy a simultaneous perturbation gradient estimation method adapted to the challenges to the problem at hand, namely unbounded action spaces, gradients and rewards. In contrast to single-timescale approaches, we find that a two-timescale approach is much more effective at controlling the (unbounded) noise introduced by payoff-based gradient estimators in this setting. Owing to the introduction of a second timescale, we show that the proposed simultaneously perturbed optimistic (SPOG) algorithm converges to equilibrium with probability 1. In addition, by developing a new method to assess the rate of convergence of two-timescales stochastic approximation procedures, we show the sequence of play induced by SPOG converges at a rate of $\tilde{\mathcal{O}}(n^{-2/3})$ in strongly monotone games. To the the best of our knowledge, this is the first convergence rate result for games with unbounded action spaces, and it is faster than the sharpest known convergence rates for single-observation, payoff-based learning in strongly monotone games with bounded action spaces.

Primary Area: learning theory

Submission Number: 17944

Loading