TL;DR: We examine diverging preferences in human-labeled preference datasets and their influences in reward modeling and LLM evaluations.
Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.
Lay Summary: We explore user disagreements in the preferred responses from LLMs. We analyze what factors lead to disagreement majority of disagreements are due to factors such as task underspecification or response style. We then examine how disagreements are handled in existing LLM training and evaluation methods, finding that standard methods incentivize LLMs to decisively prefer one response even when users disagree. We then propose methods for mitigating these behaviors in LLM training and evaluation.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: RLHF, Pluralistic Alignment
Submission Number: 13232
Loading