Reward Modeling Requires Automatic Adjustment Based on Data Quality

ACL ARR 2024 June Submission5877 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: RLHF;reward model
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 5877
Loading