On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong; Noah Lee; Eunki Kim; Guijin Son; Woojin Chung; Aman Gupta; Shao Tang; James Thorne

On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We analyze the Bradley-Terry model as a source of reward model over-optimization and propose batch-wise sum-to-zero (BatchSum) regularization,

Abstract: The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5\% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40\% while adding a 7\% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.

Lay Summary: Reward models (RMs) are crucial tools that help large language models (LLMs) align with human preferences by assigning scores to their outputs. Typically, these models learn by comparing pairs of responses, preferring one over another, to better match human choices. However, these models can overly focus on their training data, becoming less effective when encountering new or slightly different responses. This paper identifies a major cause of this problem: RMs trained in this way tend to produce uneven scores due to large variations in the internal features (hidden states) they use. When these variations become too large, models become overly confident in their scores, causing them to lose their ability to generalize. To address this, the authors propose a straightforward solution, batch-wise sum-to-zero regularization (BSR), that encourages the scores within each training batch to be balanced around zero. This prevents extreme scoring, stabilizing the model’s internal representations and significantly improving its ability to handle new data. The paper tests this solution rigorously across various scenarios, demonstrating that RMs trained with this regularization consistently outperform traditional approaches. Moreover, when these improved RMs are used for reinforcement learning in language models (i.e., RLHF), they lead to more aligned, less verbose, and more robust language outputs. Comprehensively, the paper provides a systematic analysis of the propagation of RM's over-optimization from reward modeling to RLHF training, highlighting the significance of improving the robustness of RMs in the reward modeling phase.

Link To Code: https://github.com/LinkedIn-XFACT/RM-Robustness

Primary Area: Deep Learning->Large Language Models

Keywords: RLHF, Reward modeling, Over-optimization

Submission Number: 13241

Loading