DDSPO: Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

18 Sept 2025 (modified: 29 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Image Alignment, Diffusion Models, Self-Training, Preference Optimization
TL;DR: We improve text-image alignment in diffusion models by self-generating pseudo-preference pairs and applying Diffusion DPO, without human labels. We also propose a score-based variant for more data-efficient training.
Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which—when winning/losing policies are accessible—directly derives per-timestep supervision from these policies. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision.
Primary Area: generative models
Submission Number: 12049
Loading