All Convolution, No Attention: Designing Diffusion with Convolutions

Taesung Kwon; Lorenzo Bianchi; Lennart Wittke; Felix Watine; Fabio Carrara; Jong Chul Ye; Romann M. Weber; Vinicius C. Azevedo

All Convolution, No Attention: Designing Diffusion with Convolutions

Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann M. Weber, Vinicius C. Azevedo

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, Convolutional neural networks

Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness—the attributes that established ConvNets as the default vision backbone—have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a ConvNeXt-inspired backbone redesigned for conditional diffusion modeling. Specifically, FCDM employs an easily scalable U-Net hierarchy that integrates global context with fine-grained details and preserves strict convolutional locality, maximizing throughput on modern accelerators. We find that FCDM-XL, using only half the FLOPs of DiT-XL/2, achieves superior FID with 7$\times$ and 7.5$\times$ speedups at 256$\times$256 and 512$\times$512 resolutions, respectively. Our results demonstrate that modern convolutional designs remain highly competitive when scaled and properly conditioned, challenging the prevailing view that “bigger Transformers” are the sole path to better diffusion models. FCDM revives ConvNets as a compelling, computationally efficient alternative for large-scale generative vision.

Primary Area: generative models

Submission Number: 11009

Loading