Keywords: Alignment, LLM, RLHF
TL;DR: We introduce a new robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model
Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we formulate the problem of performing robust RLHF with noisy reward models. Our goal is to design robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model. Our first contribution is an analysis that revealed a certain transformation of the preference function improves its robustness to noise in the reward function. This observation leads to a new reward function design that involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in Proximal Policy Optimization (PPO). We show that our suggested rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We also empirically demonstrate contrastive reward can improve RLHF substantially, evaluated by both GPTs and humans, and it consistently outperforms strong baselines.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8961
Loading