Towards Optimism-Pessimism Trade-off in Model-based Offline-to-Online Reinforcement Learning

Guochen Zhou; Yijun Yang; Qiqi Duan; Qing Su; Weiming Ou; Li Shen; Shihao Ji; Yuhui Shi

Towards Optimism-Pessimism Trade-off in Model-based Offline-to-Online Reinforcement Learning

Guochen Zhou, Yijun Yang, Qiqi Duan, Qing Su, Weiming Ou, Li Shen, Shihao Ji, Yuhui Shi

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning

Abstract: Model-based offline-to-online reinforcement learning (RL) provides a sample-efficient framework by pre-training environment models and control policies using offline data, followed by fine-tuning through limited online interactions. However, the distribution shifts between offline and online stages often hinders fine-tuning performance. Existing methods approach this problem by adjusting the trade-off between optimism and pessimism using a single-objective formulation, which requires online evaluation across tasks. This results in an expensive bi-level optimization procedure. In this work, we identify this optimism-pessimism trade-off during offline training as a key challenge: optimistic policies tend to generalize better to novel online tasks by exploring out-of-distribution states and actions, while pessimistic policies remain constrained to the offline data distribution and perform better on tasks that are similar to the offline tasks. To address this challenge, we propose a bi-objective formulation that captures this trade-off and yields a pool of Pareto policies during offline training. These policies reflect varying levels of trade-offs, enabling flexible selection of policies for various online tasks. To produce these policies, we introduce Multiple-Objective Soft Actor-critIC (MOSAIC), which solves multiple bi-objective optimization problems guided by reference vectors and refines the Pareto policy pool through neighborhood search. After offline training, a contextual bandit algorithm hierarchically selects the most suitable policy for fine-tuning at each online interaction step. Empirically, our pipeline,**Hi**erarchical **P**areto **P**olicy **P**ool (**HiP3**), achieves state-of-the-art performance on offline-to-online RL benchmarks with diverse online tasks. Comprehensive ablation studies are conducted to further elucidate the mechanisms behind HiP3.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 12038

Loading