Robust Reward Sequence Modeling with Multi-Scale Consistency for Model-Based Reinforcement Learning

Robust Reward Sequence Modeling with Multi-Scale Consistency for Model-Based Reinforcement Learning

ICLR 2026 Conference Submission19184 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model-based reinforcement learning, state-space model, sequence modeling

Abstract: We propose a novel framework for reliable reward modeling in model-based reinforcement learning, built on top of Mamba-based sequence models. While prior work suffers from cumulative error over long rollouts due to decoding only immediate rewards from the latent dynamics, our approach trains an ensemble of multi-horizon reward heads that each predict the cumulative return over different horizons. To tie these predictions together, we introduce a cross-horizon consistency regularization to encourage the difference between any two heads to match the prediction of their gap head. We further add a chunk-level reward model that summarizes rewards over non-overlapping blocks, and enforce consistency between chunk and per-step predictions for smoother estimates. During imagination, we dynamically select the reward heads with the lowest predictive uncertainty to guide policy rollouts, and combine these multi-scale predictions with the standard $\lambda$-return during value estimation. This design ensures that more accurate, well-conditioned reward estimates directly shape policy learning. We integrate our method into Drama, a state-of-the-art Mamba-enabled model-based agent, and evaluate on the \emph{Atari 100k} benchmark. Compared to the single-head baseline, our multi-scale, cross-horizon consistency approach reduces reward prediction error by $47\%$ on average and yields higher or comparable game scores across the suite. Our results demonstrate that explicitly modeling and regularizing rewards at multiple temporal scales and carefully enlisting the most confident predictions improve both the fidelity of imagined rollouts and the policy performance.

Primary Area: reinforcement learning

Submission Number: 19184

Loading