Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Tianwei Ni; Esther Derman; Vineet Jain; Vincent Taboga; Siamak Ravanbakhsh; Pierre-Luc Bacon

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline RL, Model-based RL, Bayesian RL, Partially Observable RL, Deep RL

TL;DR: We present a Bayesian-principled offline RL algorithm that succeeds with adaptive long-horizon planning, without conservatism.

Abstract: Popular offline reinforcement learning (RL) methods rely on *conservatism*, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a *Bayesian* perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the *neu*tral *Bay*esian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 6761

Loading