MOBA: Model-Based Offline Reinforcement Learning with Adaptive Contextual Penalties

MOBA: Model-Based Offline Reinforcement Learning with Adaptive Contextual Penalties

ICLR 2026 Conference Submission18872 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline reinforcement learning, Adaptive Contextual Penalty

Abstract: Mainstream model-based offline reinforcement learning, which aims to learn effective policies from static datasets, often employs conservatism to prevent policies from exploring out-of-support regions. For example, MOPO penalizes rewards through uncertainty measures from predicting the next states. Prior context-based methods leverage meta-learning methods, which infer latent dynamics patterns from experience to enable its policy to adapt its behavior in out-of-support regions when deployed, offering the potential to make robust decisions in out-of-support regions and outperform traditional model-based methods. However, current adaptive policy learning methods still leverage traditional conservative penalties to mitigate the compounding error of the model, which overly constrains policy exploration. In this paper, we propose Model-Based Offline Reinforcement Learning with Adaptive Contextual Penalty (MOBA), which introduces a context-aware penalty adaptation mechanism that dynamically adjusts conservatism based on trajectory history. Theoretically, we prove that MOBA maximizes a tighter lower bound on the true return compared to prior methods like MOPO, achieving an optimal trade-off between risk and generalization. Empirically, we demonstrate that MOBA outperforms state-of-the-art model-based and model-free approaches on NeoRL and d4rl benchmark tasks. Our results highlight the importance of adaptive uncertainty estimation in model-based offline RL.

Primary Area: reinforcement learning

Submission Number: 18872

Loading