Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong; Hanshan Zhang; Junwei Bao; Hongfei Jiang; yang song

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, yang song

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: a better alternative to the Bradley-Terry preference model

Abstract: Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently, the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based preference model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To showcase the practical utility of replacing BTM with our EBM in the context of offline alignment, we adapt a simple yet scalable objective function from the recent literature on fitting EBM and name it as Energy Preference Alignment (EPA). Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.

Lay Summary: The Bradley-Terry model (BTM) has served as the default preference model for RLHF and DPO applications. While this model performs adequately for the general purpose of training reward models, we identify a theoretical flaw when applying it to train implicit reward models in DPO: the DPO loss function, formulated as the maximum likelihood estimation of the BTM, admits multiple minimizers. Building on this theoretical finding, we question the established equivalence between the maximum likelihood estimation of BTM (when parameterized with the target policy) and the optimization of the RLHF objective. To address this limitation, we propose an alternative approach that circumvents this issue: the Infinite Preference Model, an energy-based model that can be trained using either the same pairwise datasets employed in DPO or datasets containing more than two responses per prompt.

Primary Area: Deep Learning->Large Language Models

Keywords: preference model, offline alignment

Submission Number: 9454

Loading