Unpaired Preference Optimization: Aligning Visual Generative Models with Scalar Feedback

08 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Models, Preference Optimization
Abstract: Direct Preference Optimization (DPO) offers a stable and simple alternative to reinforcement learning for aligning large generative models, but its reliance on paired preference comparisons is a critical limitation. In practice, feedback often arrives as unpaired scalar scores, such as human ratings, which cannot be directly used by DPO. To resolve this, we first revisit the KL-regularized alignment objective and show that for individual samples, the optimal policy is governed by an elegant but intractable decision rule: comparing a sample's reward against an instance-dependent oracle baseline. Building on this insight, we introduce Unpaired Preference Optimization (UPO), a new framework that provides a principled and tractable proxy for this ideal rule. UPO approximates the oracle baseline with a dynamic threshold derived from empirical score distribution, thereby reframing alignment as a simple classification task on unpaired data. This core mechanism is further enhanced by a confidence-weighting scheme to leverage the full magnitude of the scores. Extensive experiments demonstrate that UPO effectively aligns diverse generative models, including both diffusion and MaskGIT paradigms, significantly outperforming standard fine-tuning baselines. By extending the simplicity of DPO to the more practical setting of unpaired scalar feedback, UPO provides a principled and scalable path for aligning generative models with human preference signals.
Primary Area: reinforcement learning
Submission Number: 2998
Loading