Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning

Published: 18 Sept 2025, Last Modified: 15 Jan 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Non-Stationary Reinforcement Learning
Abstract: We study reinforcement learning in non-stationary communicating MDPs whose transition drift admits a low-rank plus sparse structure. We propose \textbf{SVUCRL} (Structured Variation UCRL) and prove the dynamic-regret bound \[\scalebox{0.75}{$\displaystyle \widetilde{\mathcal O}\Bigl( \sqrt{SA\,T} + D_{\max}S\sqrt{A\,T} + L_{\max}B_r + D_{\max}L_{\max}B_p + D_{\max}S\sqrt{A\,B_p} + D_{\max}\delta_B B_p + D_{\max}\sqrt{K\,T} \Bigr), $}\] (up to the additional planning-tolerance term $\sum_{t=1}^T \varepsilon_{\tau(m(t))}$). where $S$ is the number of states, $A$ the number of actions, $T$ the horizon, $D_{\max}$ the MDP diameter, $B_r$/$B_p$ the total reward/transition variation budgets, and $K\!\ll\! SA$ the rank of the structured drift, $L_{\max}$ is the maximum episole length. The first two terms are the statistical price of learning in stationary problems. The structure-dependent non-stationarity contribution appears through $D_{\max}\sqrt{K\,T}$ (low-rank drift) and $D_{\max}\delta_B B_p$ (sparse shocks), which scale with $\sqrt{K}$ rather than $\sqrt{SA}$ when drift is low-rank. This matches the $\sqrt{T}$ rate (up to logs) and improves on prior $T^{3/4}$-type guarantees. SVUCRL combines: (i) online low-rank tracking with explicit Frobenius guarantees, (ii) incremental RPCA to separate structured drift from sparse shocks, (iii) adaptive confidence widening via a bias-corrected local-variation estimator, and (iv) factor forecasting with an optimal shrinkage center.
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 21946
Loading