The Teaching--Regret--Stability Principle in Non-Stationary Reinforcement Learning

TMLR Paper6627 Authors

24 Nov 2025 (modified: 26 Feb 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Standard treatments of non-stationary reinforcement learning primarily emphasize tracking and evaluate performance via dynamic regret under variation-budget drift. In many deployments, however, practitioners may also care about which policy is learned (e.g., compliance/safety targets) and how smoothly it evolves over time. This motivates studying teaching to a target policy and policy-trajectory stability alongside regret, as complementary objectives rather than replacements.We formalize this viewpoint in the Teaching--Regret--Stability (TRS) Principle forTeachable Non-stationary RL (TNRL).Under standard variation-budget assumptions and a Lipschitz policy-update condition, we prove a high-level theorem showing that a bounded-budget teacher can simultaneously drive the teaching error to an arbitrarily small target, keep dynamic regret sublinear, and ensure that the policy sequence remains stable on average.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### Revisions * Add Supplementary Material * Fixed an incorrect cross-reference: the time-varying target extension is now correctly pointed to **Sec. (Conclusion and Limitations)** instead of a non-existent “Discussion” section. * **Appendix proofs:** Add complete proofs for **Lemmas 1–3** . * **TRS Principle:** Provide a **formal statement** of the TRS Principle and clarify what parameters induce the **trade-off / feasibility frontier**. * **Teacher strategy (Sec. 4.2):** State the **exact poisoning/teaching policy** used in experiments (pseudocode + cost accounting). * **Figures / uncertainty:** Clarify that error bars are **mean ± std over seeds**, and add paired frontier to provide paired robustness evidence. * **Framing + scope:** Tone down rhetoric (“structural failure/disaster”), clarify our stability notion, and qualify bandit-only claims. * **Reward hypothesis vs. multi-criteria framing:** We clearly position TRS as a multi-criteria lens that complements reward-based evaluation * **“Trade-off” language in the TRS principle:** We describe Theorem 1 as providing simultaneous guarantees rather than a formal “trade-off” * **Restrictiveness of the setting:** We add a Limitations paragraph noting that our result is a first step bridging non-stationary RL (variation budgets) * **Title:** We change the title to better reflect the core contribution. * **Discount factor:** We relax the convention to allow (\gamma \in [0,1]) (including (\gamma=0)), and clarify that for horizon-one contextual bandits (\gamma) is immaterial.
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 6627
Loading