$\alpha$-DPO: Robust Preference Alignment for Diffusion Models via $\alpha$ Divergence

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion model; preference alignment; noise robustness
Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image generation, yet aligning them with human preferences remains challenging. Direct Preference Optimization (DPO) offers a promising framework, but its effectiveness is critically hindered by noisy data arising from mislabeled preference pairs and individual preference pairs. We theoretically show that existing DPO objectives are equivalent to minimizing the Forward Kullback–Leibler (KL) divergence, whose mass-covering nature makes it intrinsically sensitive to such noise. To address this limitation, we propose $\alpha$-DPO, which reformulates preference alignment through the lens of $\alpha$-divergence. This formulation promotes mode-seeking behavior and bounds the influence of outliers, thereby enhancing robustness. Furthermore, we introduce a dynamic scheduling mechanism that adaptively adjusts $\alpha$ according to the observed preference distribution, providing data-aware noise tolerance during training. Extensive experiments on synthetic and real-world datasets validate that $\alpha$-DPO consistently outperforms existing baselines, achieving superior robustness and preference alignment.
Primary Area: reinforcement learning
Submission Number: 6618
Loading