Keywords: Diffusion models, Text-to-image generation, Test-time alignment, Preference alignment
Abstract: Pre-trained diffusion models demonstrate remarkable performance in text-to-image generation, with current research efforts directed toward aligning them with human preferences across diverse application scenarios. Existing approaches often rely on costly pipelines that require collecting preference data, training reward models, and fine-tuning. A promising alternative is test-time alignment, which steers diffusion models during sampling without retraining. However, current test-time alignment methods typically depend on explicit reward models to provide a guidance signal for modifying a sampling path. These involve decoding a noisy image and estimating its rewards, which adds extra steps with computational overhead and might limit flexibility across diverse scenarios. We propose Contrastive Gradient Guidance (CGG), a conceptually straightforward and practical framework for test-time alignment that avoids explicit reward models by design. CGG derives its guidance signal from the contrastive difference between two diffusion models, parameterized through the gradient of the log-likelihood ratio of the favored and the unfavored distributions. The guidance signal steers a pre-trained diffusion model along its sampling path while implicitly aligning generation with human preferences. Experiments demonstrate that CGG consistently improves preference alignment in text-to-image generation and flexibly adapts to safety-critical and multi-preference scenarios. Moreover, CGG can be combined with prevailing test-time alignment techniques to yield additional gains. These results establish CGG as a principled framework for advancing test-time alignment of diffusion models.
Primary Area: generative models
Submission Number: 22280
Loading