Keywords: RLHF, Pluralistic Alignment, Human Preferences
TL;DR: We examine examples with diverging preferences, instances where annotators dis agree on which response is preferred over the other, in human-labeled preference datasets.
Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact reward modeling. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences.
Submission Number: 51
Loading