Diffusion Models Without Attention

Jing Nathan Yan, Jiatao Gu, Alexander M Rush

Published: 09 Jun 2024, Last Modified: 29 Sept 2024CVPR 2024EveryoneCC BY 4.0

Abstract: Denoising diffusion probabilistic models (DDPMs) have spearheaded advances in high-fidelity image generation. However, there remain considerable computational challenges when scaling current DDPM architectures to high-resolutions, due to the use of attention either in UNet architectures or Transformer variants. To make models tractable, it is common to employ lossy compression techniques in hidden space, such as patchifying, which trade-off representational capacity for efficiency. We propose Diffusion State Space Model (DiffuSSMs), an architecture that replaces attention with a more efficient state space model backbone. The model avoids global compression, enabling longer, more fine-grained image representation in the diffusion process. Comprehensive validations on ImageNet and LSUNs indicate superior performance in terms of FID and Inception Score at reduced total FLOP usage compared to previous diffusion models using attention.