The Stability and Convergence of Two-Timescale Stochastic Approximation with Markovian Noise for Reinforcement Learning

Vagul Mahadevan; Claire Chen; Shuze Daniel Liu; Shangtong Zhang

The Stability and Convergence of Two-Timescale Stochastic Approximation with Markovian Noise for Reinforcement Learning

Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stochastic Approximation, Reinforcement Learning

TL;DR: This work contains the first proofs of stability and convergence of two-timescale stochastic approximation algorithms with Markovian noise, and an application to a reinforcement learning algorithm of interest.

Abstract: Stochastic approximations (SA)--algorithms which derive their power through the use of random, incremental updates--are at the heart of reinforcement learning (RL). Expanding the theory of SA has established rigorous results concerning the most important algorithms in RL, including stochastic gradient descent and temporal difference learning. In this work, we focus on two-timescale stochastic approximations, a class which notably includes temporal difference learning with gradient correction (TDC) and actor-critic methods. Prior work has developed stability (boundedness) and convergence criteria for two-timescale SA under i.i.d. noise, but analogous results for Markovian noise have remained elusive--a critical issue since RL data are generated by a Markov chain, making i.i.d. assumptions unrealistic. To address this gap, we present the first stability result and the first asymptotic convergence result for two-timescale schemes with Markovian noise under general, verifiable conditions--notably, without resorting to projected variants of the schemes or requiring the noise to be in a compact space. As a key application, we contribute the first asymptotic convergence proof of TDC, an off-policy prediction algorithm with linear approximation and eligibility traces. Together, our results extend SA theory, establishing the first theoretical foundation for analysis of two-timescale algorithms with the realistic noise models inherent to RL.

Primary Area: reinforcement learning

Submission Number: 23116

Loading