Keywords: Non-Stationary Reinforcement Learning
Abstract: We study reinforcement learning in non-stationary communicating MDPs whose
transition drift admits a low-rank plus sparse structure. We propose
SVUCRL (Structured Variation UCRL) and prove the dynamic-regret bound
$
\widetilde{\mathcal O}\bigl(
D_{\max}S\sqrt{A T}
+D_{\max}\sqrt{(B_r+B_p)K S T}
+D_{\max}\,\delta_B\,B_p
\bigr).
$
where $S$ is the number of states, $A$ the number of actions, $T$ the horizon,
$D_{\max}$ the MDP diameter, $B_r$/$B_p$ the total reward/transition variation
budgets, and $K SA$ the rank of the structured drift. The first term is the
statistical price of learning in stationary problems; the second is the
\emph{non-stationarity price}, which scales with $\sqrt{K}$ rather than
$\sqrt{SA}$ when drift is low-rank. This matches the $\sqrt{T}$ rate (up to
logs) and improves on prior $T^{3/4}$-type guarantees. SVUCRL combines:
(i) online low-rank tracking with explicit Frobenius guarantees,
(ii) incremental RPCA to separate structured drift from sparse shocks,
(iii) adaptive confidence widening via a bias-corrected local-variation
estimator, and (iv) factor forecasting with an optimal shrinkage center.
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 21946
Loading