Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael JQ Zhang; Zhilin Wang; Jena D. Hwang; Yi Dong; Olivier Delalleau; Yejin Choi; Eunsol Choi; Xiang Ren; Valentina Pyatkin

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

Published: 10 Oct 2024, Last Modified: 15 Nov 2024Pluralistic-Alignment 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF, Pluralistic Alignment, Human Preferences

TL;DR: We examine examples with diverging preferences, instances where annotators dis agree on which response is preferred over the other, in human-labeled preference datasets.

Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact reward modeling. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences.

Submission Number: 51

Loading