Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning

ICLR 2026 Conference Submission19619 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning+preferences+RLHF+contrastive learning
Abstract: Preference-based Reinforcement Learning (PbRL) entails a variety of approaches for aligning models with human intent to alleviate the burden of reward engineering. However, most previous PbRL work has not investigated the robustness to labeler errors, inevitable with labelers who are non-experts or operate under time constraints. We introduce Similarity as Reward Alignment (SARA), a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats. SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent. On preference data with varying realistic noise rates, we demonstrate strong and consistent performance on continuous control offline RL benchmarks, while baselines often degrade severely with noise. We further demonstrate SARA's versatility in applications such as cross-task preference transfer and reward shaping in online learning.
Primary Area: reinforcement learning
Submission Number: 19619
Loading