Reliability-Aware Preference Learning for LLM Reward Models

Shivam Singhal; Cassidy Laidlaw; Anca Dragan

Reliability-Aware Preference Learning for LLM Reward Models

Shivam Singhal, Cassidy Laidlaw, Anca Dragan

28 Sept 2024 (modified: 17 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: preference learning, RLHF, human models, scalable oversight

TL;DR: By explicitly accounting for when humans give unreliable feedback, we can learn reward functions that better align with human values.

Abstract: Reward functions learned from human feedback serve as the training objective for RLHF, the current state-of-the-art approach for aligning large language models to our values. However, in practice, these reward models fail to robustly capture our desiderata, often attributing more value to features such as output length or agreement with the user and less value to important features like factual correctness. A major reason is that human annotators provide feedback that is an unreliable reflection of their true preferences because of knowledge gaps, limited resources, cognitive biases, or other factors. We focus on making preference learning robust to unreliable feedback by explicitly modeling the knowledge and judgment of annotators. In particular, we estimate reliablity scores for each provided pairwise comparison and incoporate them into the implicit human model used in RLHF, DPO, and other alignment techniques, a technique we call Reliability Aware Preference Learning (RAPL). To test our approach, we introduce the Length Incentivized Evaluations dataset as a setting in which annotators are particularly likely to provide unreliable feedback. Then, we curate the Testing Reasoning and Understanding Errors dataset for training models to predict reliability scores. We find that traditional preference learning on the LIE dataset and other commonly used RLHF datasets leads to models that place far more weight on output length than accuracy. In contrast, RAPL results in models that better capture the true values of annotators.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12607

Loading