Regret Is Not Enough: Teaching and Stability in Non-Stationary Reinforcement Learning

TMLR Paper6627 Authors

24 Nov 2025 (modified: 10 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Standard treatments of non-stationary reinforcement learning cast it as a tracking problem, tacitly accepting any policy that keeps pace with a drifting optimum and relegating instability to a minor algorithmic concern. Yet in safety-critical, value-laden domains, decisions answer to external stakeholders, and the central question becomes not just how fast we track non-stationarity, but whether the learner is teachable under drift without sacrificing performance or stability. We formalize this question in what we call the \emph{Teaching--Regret--Stability (TRS) Principle} for \emph{Teachable Non-stationary RL (TNRL)}. Under standard variation-budget assumptions and a Lipschitz policy-update condition, we prove a high-level theorem showing that a bounded-budget teacher can simultaneously drive the teaching error to an arbitrarily small target, keep dynamic regret sublinear, and ensure that the policy sequence remains stable on average.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 6627
Loading