Abstract: Standard treatments of non-stationary reinforcement learning primarily emphasize tracking and evaluate performance via dynamic regret under variation-budget drift. In many deployments, however, practitioners may also care about which policy is learned (e.g., compliance/safety targets) and how smoothly it evolves over time. This motivates studying teaching to a target policy and policy-trajectory stability alongside regret, as complementary objectives rather than replacements.We formalize this viewpoint in the Teaching--Regret--Stability (TRS) Principle forTeachable Non-stationary RL (TNRL).Under standard variation-budget assumptions and a Lipschitz policy-update condition, we prove a high-level theorem showing that a bounded-budget teacher can simultaneously drive the teaching error to an arbitrarily small target, keep dynamic regret sublinear, and ensure that the policy sequence remains stable on average.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### Revisions
* Add Supplementary Material
* Fixed an incorrect cross-reference: the time-varying target extension is now correctly pointed to **Sec. (Conclusion and Limitations)** instead of a non-existent “Discussion” section.
* **Appendix proofs:** Add complete proofs for **Lemmas 1–3** .
* **TRS Principle:** Provide a **formal statement** of the TRS Principle and clarify what parameters induce the **trade-off / feasibility frontier**.
* **Teacher strategy (Sec. 4.2):** State the **exact poisoning/teaching policy** used in experiments (pseudocode + cost accounting).
* **Figures / uncertainty:** Clarify that error bars are **mean ± std over seeds**, and add paired frontier to provide paired robustness evidence.
* **Framing + scope:** Tone down rhetoric (“structural failure/disaster”), clarify our stability notion, and qualify bandit-only claims.
* **Reward hypothesis vs. multi-criteria framing:** We clearly position TRS as a multi-criteria lens that complements reward-based evaluation
* **“Trade-off” language in the TRS principle:** We describe Theorem 1 as providing simultaneous guarantees rather than a formal “trade-off”
* **Restrictiveness of the setting:**
We add a Limitations paragraph noting that our result is a first step bridging non-stationary RL (variation budgets)
* **Title:** We change the title to better reflect the core contribution.
* **Discount factor:** We relax the convention to allow (\gamma \in [0,1]) (including (\gamma=0)), and clarify that for horizon-one contextual bandits (\gamma) is immaterial.
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 6627
Loading