Scalable Oversight by Accounting for Unreliable Feedback

Shivam Singhal; Cassidy Laidlaw; Anca Dragan

Scalable Oversight by Accounting for Unreliable Feedback

Shivam Singhal, Cassidy Laidlaw, Anca Dragan

Published: 17 Jun 2024, Last Modified: 02 Jul 2024ICML 2024 Workshop MHFAIA OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scalable oversight, preference learning, reinforcement learning from human feedback, unreliable human supervision, bounded rationality

TL;DR: Accounting for how reliable an annotator’s feedback is expected to be during preference learning can lead to more robust reward models that place more weight on important features, such as factual correctness.

Abstract: Reward functions learned from human feedback serve as the training objective for RLHF, the current state-of-the-art approach for aligning large language models to our values; however, in practice, these reward models fail to robustly capture our desiderata. For instance, they often place more weight on the length of the output or agreement with the user and less on important features like factual correctness. A major reason behind these shortcomings of learned reward functions is the fact that human annotator feedback on which the models are trained is unreliable. Due to knowledge gaps, limited resources, cognitive biases, or other factors, annotators may not be able to accurately judge the model's outputs, and thus, their feedback may not be reliably aligned with their true preferences. Current proposals to address the challenges posed by unreliable feedback include asking annotators only easy questions that they can easily answer, providing them with an AI assistant during evaluation, and relying primarily on AI feedback with limited human supervision (e.g., constitutional AI). However, it remains unclear how practical and scalable these approaches are. We identify a complementary strategy that can easily be incorporated into existing alignment methods (e.g., RLHF, DPO, etc.): explicitly modeling the annotators’ knowledge and judgment in order to better learn from unreliable feedback. In particular, we propose an adjustment to the Bradley-Terry model used in preference learning that accounts for how well an annotator’s feedback is expected to match their true values or preferences. We test our approach in a setting where annotators are likely to provide unreliable feedback, and we find that it results in preference models that assign higher value to important characteristics, like factuality, than existing methods.

Submission Number: 79

Loading