Keywords: large language model, reward model, bias mitigation
TL;DR: We propose CHARM, a calibration method that mitigates model preference bias in reward models by leveraging Elo scores from Chatbot Arena, improving fairness and alignment with human preferences.
Abstract: Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. However, they suffer from various biases which could lead to reward hacking. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models, leading to unfair judgments. To mitigate this bias, we propose a calibration method named **CH**atbot **A**rena calibrated **R**eward **M**odeling (**CHARM**) that leverages Elo scores from the Chatbot Arena to construct debiased preference datasets and adjust reward model scoring. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench and exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. Beyond this, **CHARM** enhances robustness to stylistic variations, mitigates implicit pattern bias, and generalizes to unseen models. These results demonstrate that **CHARM** provides a simple, effective, and broadly applicable approach to building more reliable and fair reward models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5208
Loading