Preference-based Policy Optimization from Sparse-reward Offline Dataset

ICLR 2026 Conference Submission10578 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Offline Reinforcement Learning, Preference-based Reinforcement Learning
TL;DR: We introduce a contrastive framework that mitigates value overestimation in offline RL by training a policy to prefer successful trajectories over both observed and synthetic failures, leading to state-of-the-art results.
Abstract: Offline reinforcement learning (RL) holds the promise of training effective policies from static datasets without the need for costly online interactions. However, offline RL faces key limitations, most notably the challenge of generalizing to unseen or infrequently encountered state-action pairs. When a value function is learned from limited data in sparse-reward environments, it can become overly optimistic about parts of the space that are poorly represented, leading to unreliable value estimates and degraded policy quality. To address these challenges, we introduce a novel approach based on contrastive preference learning that bypasses direct value function estimation. Our method trains policies by contrasting successful demonstrations with failure behaviors present in the dataset, as well as synthetic behaviors generated outside the support of the dataset distribution. This contrastive formulation mitigates overestimation bias and improves robustness in offline learning. Empirical results on challenging sparse-reward offline RL benchmarks show that our method substantially outperforms existing state-of-the-art baselines in both learning efficiency and final performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10578
Loading