On-Policy Model Error Suffices: An Invariant-Measure Return-Gap Bound for Model-Based Reinforcement Learning
Abstract: We study the discounted return gap between a fixed policy evaluated on a true dynamical system and on a learned closed-loop model. Lipschitz-based bounds in the model-based reinforcement learning literature control this gap by the \emph{supremum} of the one-step model error over the state space, amplified by the global closed-loop Lipschitz constant; this is pessimistic for systems whose closed-loop trajectories concentrate on a low-dimensional attractor. We prove a return-gap bound whose dominant term is the one-step model error \emph{averaged under the invariant measure of the true closed loop}, amplified by a trajectory-localized linearised contraction rate, plus geometrically-decaying transients. The bound recovers the classical sup-norm bound as a limiting case; its leading term is strictly smaller whenever the invariant-measure-averaged error \(\bar\eps_\mu\) is strictly below the global supremum error \(\eps_0\), as occurs when large model errors lie off the closed-loop attractor. We exhibit a regime in which this distinction is qualitative: the classical bound is infinite while ours is finite. As a consequence, the empirical on-policy mean-squared error minimized by modern world-model algorithms upper-bounds (up to a square-root and a finite-sample concentration term) the return-gap-controlling quantity, giving the training
objective an explicit return-gap interpretation. We extend the result to stochastic dynamics via a Wasserstein-\(1\) coupling, and prove a matching bound on the Wasserstein distance between the true and learned-model invariant measures.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Steffen_Udluft1
Submission Number: 8614
Loading