Diffusion Alignment as Variataional Expectation-Maximization

ICLR 2026 Conference Submission10393 Authors

18 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Model, Alignment, RLHF, Test time search
TL;DR: Diffusion Alignment as Variational Expectation-Maximization (DAV) alternates test-time search (E-step) and forward-KL distillation (M-step) to align continuous and discrete diffusion models.
Abstract: Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.
Primary Area: generative models
Submission Number: 10393
Loading