Two-Timescale Q-Learning with Function Approximation in Zero-Sum Stochastic Games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman E. Ozdaglar, Adam Wierman

Published: 2024, Last Modified: 28 Jan 2025EC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We consider two-player zero-sum stochastic games and propose a two-timescale variant of Q-learning with function approximation that is payoff-based, convergent, rational, and symmetric between the two players. In two-timescale Q-learning, the fast-timescale iterates are updated in spirit to the stochastic gradient descent for minimizing a Bellman error variant and the slow-timescale iterates (which we use to compute the policies) are updated by taking a convex combination between its previous iterate and the latest fast-timescale iterate. In the special case of linear function approximation, we present, to the best of our knowledge, the first last-iterate finite-sample bound for payoff-based independent learning dynamics of these types.To establish the results, we analyze our proposed algorithm using a two-timescale stochastic approximation framework, and derive the finite-sample bound through a Lyapunov-based approach. The key technical novelty lies in the construction of a valid Lyapunov function to capture the evolution of the slow-timescale iterates. Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function. This insight enables us to construct a valid Lyapunov function via a generalized variant of the Moreau envelope of the regularized Nash gap. The construction of our Lyapunov function might be of independent interest in studying the dynamics of general stochastic approximation algorithms.The full paper is publicly available at https://arxiv.org/abs/2312.04905.