Keywords: Non-Stationary Reinforcement Learning
Abstract: We study reinforcement learning in non-stationary communicating MDPs whose
transition drift admits a low-rank plus sparse structure. We propose
\textbf{SVUCRL} (Structured Variation UCRL) and prove the dynamic-regret bound
\[\scalebox{0.75}{$\displaystyle
\widetilde{\mathcal O}\Bigl(
\sqrt{SA\,T}
+ D_{\max}S\sqrt{A\,T}
+ L_{\max}B_r
+ D_{\max}L_{\max}B_p
+ D_{\max}S\sqrt{A\,B_p}
+ D_{\max}\delta_B B_p
+ D_{\max}\sqrt{K\,T}
\Bigr),
$}\]
(up to the additional planning-tolerance term $\sum_{t=1}^T \varepsilon_{\tau(m(t))}$).
where $S$ is the number of states, $A$ the number of actions, $T$ the horizon,
$D_{\max}$ the MDP diameter, $B_r$/$B_p$ the total reward/transition variation
budgets, and $K\!\ll\! SA$ the rank of the structured drift, $L_{\max}$ is the maximum episole length. The first two terms are the statistical price of learning in stationary problems.
The structure-dependent non-stationarity contribution appears through
$D_{\max}\sqrt{K\,T}$ (low-rank drift) and $D_{\max}\delta_B B_p$ (sparse shocks),
which scale with $\sqrt{K}$ rather than $\sqrt{SA}$ when drift is low-rank. This matches the $\sqrt{T}$ rate (up to
logs) and improves on prior $T^{3/4}$-type guarantees. SVUCRL combines:
(i) online low-rank tracking with explicit Frobenius guarantees,
(ii) incremental RPCA to separate structured drift from sparse shocks,
(iii) adaptive confidence widening via a bias-corrected local-variation
estimator, and (iv) factor forecasting with an optimal shrinkage center.
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 21946
Loading