RMA: Reward Model Alignment with Human preference

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward model, Supervised Training, Dataset Generation
Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences. These models are typically trained on datasets containing an input prompt, two model-generated responses, and a preference label indicating which response is preferred. However, current approaches often suffer from limited generalization, exhibiting inconsistent performance across different contexts and displaying biases such as position bias (favoring the first response), verbosity bias (preferring longer outputs), or self-enhancement bias (favoring self-reinforcing statements). In this work, we propose Preference Prediction, a novel framework that leverages high-quality preference data validated by human annotators along with open source data, combined with a preference selector trained via supervised fine-tuning (SFT), to dynamically choose the most suitable model for a given context. Through comprehensive experiments on a variety of datasets, we show that our proposed Reward Model Alignment (RMA) not only surpasses existing reward models in performance but also significantly boosts the effectiveness of other distinct reward models when applied to synthetic data. Additionally, RMA promotes the generation of more diverse and high-quality responses by integrating multiple quality dimensions—such as helpfulness, relevance, and completeness—into the prompting process.
Submission Number: 17
Loading