Linear $Q$-Learning Does Not Diverge in $L^2$: Convergence Rates to a Bounded Set

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: $Q$-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear $Q$-learning. Building on this success, this paper further establishes the first $L^2$ convergence rate of linear $Q$-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $\epsilon$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.
Lay Summary: Reinforcement learning helps computers learn to make decisions, like choosing moves in games or guiding robots. $Q$-learning is a popular method for finding the best actions. While standard $Q$-learning was shown to settle on good solutions, those proofs needed extra tweaks $Q$-learning, no one had proven it converges to a bounded range—many believed it could spiral out of control. Our research changes that. We’re the first to show linear $Q$-learning stays within a safe range, not exploding uncontrollably. For standard $Q$-learning, we prove it finds the best actions under practical conditions, using fewer restrictions. Our new math approach tracks how both methods update decisions in ever-changing scenarios, like a game with shifting rules. This work makes $Q$-learning more trustworthy for real-world tasks, like self-driving cars or smart assistants, where fast, accurate learning is vital. Our findings help developers create AI that learns reliably and quickly, even in tricky, unpredictable situations, paving the way for more effective and dependable technology.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: linear $Q$ learning, convergence rate, reinforcement learning, the deadly triad
Submission Number: 5190
Loading