Robust RLHF with Noisy Rewards

Wei Shen; Xiaoying Zhang; Yuanshun Yao; Rui Zheng; Hongyi Guo; Yang Liu

Robust RLHF with Noisy Rewards

Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu

27 Sept 2024 (modified: 27 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, LLM, RLHF

TL;DR: We introduce a new robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model

Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we formulate the problem of performing robust RLHF with noisy reward models. Our goal is to design robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model. Our first contribution is an analysis that revealed a certain transformation of the preference function improves its robustness to noise in the reward function. This observation leads to a new reward function design that involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in Proximal Policy Optimization (PPO). We show that our suggested rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We also empirically demonstrate contrastive reward can improve RLHF substantially, evaluated by both GPTs and humans, and it consistently outperforms strong baselines.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8961

Loading