Keywords: Diffusion Models, Preference Optimization
Abstract: Direct Preference Optimization (DPO) offers a stable and simple alternative to reinforcement learning for aligning large generative models, but its reliance on paired preference comparisons is a critical limitation. In practice, feedback often arrives as unpaired scalar scores, such as human ratings, which cannot be directly used by DPO. To resolve this, we first revisit the KL-regularized alignment objective and show that for individual samples, the optimal policy is governed by an elegant but intractable decision rule: comparing a sample's reward against an instance-dependent oracle baseline.
Building on this insight, we introduce Unpaired Preference Optimization (UPO), a new framework that provides a principled and tractable proxy for this ideal rule. UPO approximates the oracle baseline with a dynamic threshold derived from empirical score distribution, thereby reframing alignment as a simple classification task on unpaired data. This core mechanism is further enhanced by a confidence-weighting scheme to leverage the full magnitude of the scores. Extensive experiments demonstrate that UPO effectively aligns diverse generative models, including both diffusion and MaskGIT paradigms, significantly outperforming standard fine-tuning baselines. By extending the simplicity of DPO to the more practical setting of unpaired scalar feedback, UPO provides a principled and scalable path for aligning generative models with human preference signals.
Primary Area: reinforcement learning
Submission Number: 2998
Loading