Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data

Published: 13 Jul 2025, Last Modified: 25 Jul 2025arXivEveryoneCC BY 4.0
Abstract: Diffusion models have achieved remarkable success in various generative tasks, but training them remains highly resource-intensive, often with millions of images and GPU days of computation required. From a data-centric perspective addressing the limitation, we study diffusion dataset condensation as a new challenging problem setting that aims at constructing a ``synthetic'' sub-dataset with significantly fewer samples than the original dataset for training high-quality diffusion models significantly faster. To the best of our knowledge, we are the first to formally study the dataset condensation task for diffusion models, while conventional dataset condensation focused on training discriminative models. For this new challenge, we further propose a novel Diffusion Dataset Condensation ($D^2C$) framework, that consists of two phases: \textit{Select} and \textit{Attach}. The \textit{Select} phase identifies a compact and diverse subset via a diffusion difficulty score and interval sampling, upon which the \textit{Attach} phase enhances conditional signals and information of the selected subset by attaching rich semantic and visual representations. Extensive experiments across dataset sizes, model architectures, and resolutions demonstrate that our $D^2C$ can train diffusion models significantly faster with dramatically fewer data while retaining high visual quality. Notably, for the SiT-XL/2 architecture, our $D^2C$ achieves a $100\times$ acceleration, reaching a FID of 4.3 in just 40k steps using only 0.8\% of the training data.
Loading