Abstract: Reward functions learned from human feedback are the backbone of reinforcement learning from human feedback (RLHF), the current state-of-the-art approach for aligning large language models to our values. However, reward models (RMs) often fall short of capturing our true preferences, overemphasizing superficial features like length while undervaluing crucial aspects like factual accuracy. A major reason behind this failure is how standard preference learning essentially ignores the inherent limitations of the human annotators providing preference data, including their cognitive biases, knowledge gaps, and resource constraints. To address this, we propose Reliability-Aware Preference Learning (RAPL), which explicitly accounts for varying annotator reliability. Specifically, RAPL modifies the standard preference learning loss function based on an estimate of how reliable annotator feedback will be for each preference comparison pair. We call these parameters annotator reliability metrics (ARMs) and demonstrate how to estimate them based on annotator behavior indicators (e.g., self-reported confidence) or models specifically fine-tuned to predict annotator reliability. Extensive experiments reveal that RMs trained using standard preference learning inherit annotator biases. On the other hand, RAPL effectively amplifies the signal from reliable judgments while attenuating less trustworthy feedback, leading to models that better align with annotators' true preferences.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We moved some results from the appendix into the main text (Tables 2 and 3) in response to reviewers. We also clarified our contributions at the end of the introduction.
Assigned Action Editor: ~Dong_Guo4
Submission Number: 6256
Loading