Normalized Rewards for Preference Learning

Normalized Rewards for Preference Learning

ICLR 2026 Conference Submission12779 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: preference learning, reward

Abstract: Direct Alignment Algorithms (DAAs) such as DPO have become a common way to post-train and align LLMs with human preferences. However, DAAs have been observed to over-optimize their implicit reward model and decrease the likelihood of preferred responses. We provide evidence for a hypothesis that the over-optimization stems in part from a mismatch in the partition function estimate of the learned model and the optimal model. In particular, transformers return a normalized distribution over tokens and therefore have a partition function of one, suggesting that the true partition function should remain fixed throughout training. However, existing DAAs do not account for this as their objectives do not include terms to optimize the partition function. To counteract this undesired side-effect of DAAs, we examine using objectives that add a regularization term to maintain the total length-normalized probabilities of the chosen and rejected responses. To better understand over optimization, we investigate how response likelihood changes are distributed over the tokens with and without regularization. We find that a significant portion of the likelihood changes are due to a small set of outlier tokens, which explains how DAAs improve generation quality despite decreasing the likelihoods of chosen responses. We apply the proposed regularization to reference-based (DPO) and reference-free (SimPO) methods and find (1) improved trade-offs between generation quality and general benchmark capability and (2) improvements in reward modeling across datasets. For example, on Llama-3.1-8B-Instruct, we see both a >20% increase in AlpacaEval2 scores and >9% performance gains on general benchmarks. Additionally, we find that the added regularization term effectively mitigates the amount of displacement within preferred responses overall, and for the outlier tokens specifically, by utilizing low-likelihood tokens.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12779

Loading