TL;DR: We show that BT-loss in reward modeling leads to spurious learning signals due to representation distance, thus propose a normalization to rescale update focusing on prediction error.
Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) **prediction error**, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) **representation distance** between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs overshadowing those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales update to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.
Lay Summary: Reward models help LLMs to learn human preferences. They are usually trained by comparing a preferred response with a rejected one. Ideally, the model should learn most from pairs it ranks incorrectly.
However, we find that the standard training method can give too much weight to preference pairs that look very different to the model, while giving too little weight to subtle pairs that require fine-grained judgment. This makes the model focus on easy distinctions and underlearn harder ones, especially in reasoning tasks.
We propose NormBT, a simple training modification that balances the learning signal across response pairs. It helps the model focus more on whether its preference judgment is correct, rather than on how internally different the two responses are. NormBT is easy to add to existing reward-model training and improves performance across several models and datasets, with especially strong gains on reasoning evaluations.
Link To Code: https://github.com/txie1/NormBT
Primary Area: Deep Learning->Large Language Models
Keywords: LLM Alignment, RLHF, Reward Model, Bradley-Terry
Originally Submitted PDF: pdf
Submission Number: 18748
Loading