Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Preference optimization, learning from feedback, language model alignment
Abstract: Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective for matching relative preferences, these methods have been widely observed to depress the absolute likelihoods of example responses. Consequently, aligned models often exhibit behaviors that deviate from expected patterns, resembling the well‑known reward‑hacking effect even in the absence of an explicit reward model. This phenomenon exposes a fundamental limitation of contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO)—the seminal direct alignment method—and show that its loss admits a principled decomposition. The resulting reformulation not only extends naturally to a broader range of feedback types, but also sheds light on the origin of likelihood underdetermination. In particular, we identify that the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and restoring its full version effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that handles diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 20535
Loading