Keywords: Reinforcement Learning from Human Feedback, Reward Modeling
Abstract: Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between humans and large language models (LLMs), with Reward Models (RMs) playing a pivotal role. RLHF and sampling techniques, such as Best-of-N, require RMs to provide reliable rewards to guide policy training or sample selection. However, despite the advancement of LLMs, critical issues in RMs persist, such as overestimation on out-of-distribution (OOD) data (also known as reward hacking) and a preference for verbose outputs (length bias). These issues undermine the reliability of RM-generated rewards. Training an unbiased RM requires addressing these challenges, yet there is a lack of in-depth analysis on RMs. In this paper, we first decompose the RM training pipeline and identify three key aspects critical for developing an unbiased RM: 1) model architectures, 2) training paradigms, and 3) the influence of preference data. For each aspect, we conduct thorough empirical studies, revealing several insightful design considerations. Building on our findings, we develop an RM capable of mitigating the identified issues. This study represents the first comprehensive examination of various challenges from a holistic perspective in RM training, offering in-depth analyses of essential concerns and providing guidance for training unbiased RMs that can accurately guide downstream policies. The relevant code and models will be made publicly available.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11237
Loading