Keywords: Diffusion models, efficiency, memorization
TL;DR: We propose a novel diffusion model framework that incorporates explicit memory mechanism into the diffusion modeling which accelerates training by over 50 times on ImageNet 256x256.
Abstract: Conditional diffusion models require external guidance for generation, but common signals like text prompts are often noisy, necessitating prolonged training on massive, high-quality paired datasets.
To address this, we introduce Generative Modeling with Explicit Memory (GMem), a framework that instead conditions generation on high-quality semantic information extracted directly from the data themselves.
Such conditioning is stored in an external memory bank, providing an accurate guidance signal that can accelerate training by a large margin.
Our experiments on ImageNet $256\times 256$ show that \method achieves a $50\times$ training speedup over SiT while also reaching a state-of-the-art (SoTA) FID of $1.53$.
The key contributions of our work are threefold:
(i) We demonstrate significant training acceleration on ImageNet datasets.
(ii) We propose an efficient downstream adaptation pathway, where the image-pretrained model serves as a base model for adapting to new tasks.
(iii) We introduce a data- and compute-efficient text-to-image (T2I) pipeline that matches the quality of strong baselines like PixelArt-$\alpha$ using only $\frac{1}{17}$ of the data and $\frac{1}{9}$ of the training time.
Our work establishes conditioning with explicit memory as a powerful paradigm for efficient and effective generative modeling.
Our code will be made publicly available.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 12784
Loading