Inductive Moment Matching

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Moment Matching Self-Distillation (MMSD), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, MMSD does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, MMSD guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. MMSD surpasses diffusion models on ImageNet-256x256 with 2.13 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 2.05 on CIFAR-10 for a model trained from scratch.
Lay Summary: Diffusion models and Flow Matching are slow at sampling. Even if perfectly trained, they require tens and hundreds of steps to produce high-quality samples. Recent approaches focus on distilling these slow samplers into fast one-step or few-step ones as a post-training step, but this two-stage approaches often require extensive tuning, e.g. balancing training of two networks or tuned training schedules. We investigate a single-stage approach without adversarial losses to directly achieve few-step sampling during inference, and we surpass diffusion models on both quality and sampling efficiency on standard benchmarks. We term our approach Inductive Moment Matching (IMM), which learns by using Maximum Mean Discrepancy (MMD), a stable divergence metric that matches two probability distributions using samples. In addition, we incorporate a learning strategy that allows the model to learn from its own samples inspired by mathematical induction. We theoretically prove that IMM is guaranteed to converge to the data distribution and is empirically more stable than other few-step approaches such as Consistency Training while achieving better performance on ImageNet-256x256 with only 8 steps compared to diffusion models. Our work marks a step towards few-step models trained from scratch, opening up possibilities of high-quality synthesis on higher-dimensional data and real-time capability without complex multi-stage training strategies.
Link To Code: https://github.com/lumalabs/imm
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: generative models, diffusion models, flow matching, moment matching, consistency models
Submission Number: 3490
Loading