Multi-Scale Image Diffusion Transformers: Explainability Leads to Faster Training

Joshua Fixelle; Mircea Stan

Multi-Scale Image Diffusion Transformers: Explainability Leads to Faster Training

Joshua Fixelle, Mircea Stan

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Models, Vision Transformers, Generative Images, Explainable AI, Training Efficiency

TL;DR: We show that Diffusion Transformers (DiTs) act as semantic autoencoders, and propose the Multi-Scale Diffusion Transformer (MDiT) which achieves 3x faster convergence and 7x overall training speedup on ImageNet, with fewer training images and FLOPs.

Abstract: Diffusion models have significantly advanced image synthesis but often face high computational demands and slow convergence rates during training. To tackle these challenges, we propose the Multi-Scale Diffusion Transformer (MDiT), which incorporates heterogeneous, asymmetric, scale-specific transformer blocks to reintroduce explicit inductive structural biases into diffusion transformers (DiTs). Using explainable AI techniques, we demonstrate that DiTs inherently learn these biases, exhibiting distinct encode-decode behaviors, effectively functioning as semantic autoencoders. Our optimized MDiT architecture leverages this understanding to achieve a $\ge 3\times$ increase in convergence speed on FFHQ-256x256 and ImageNet-256x256, culminating in a $7\times$ training speedup on ImageNet compared with state-of-the-art models. This acceleration significantly reduces the computational requirements for training, measured in FLOPs, enabling more efficient resource use and enhancing performance on smaller datasets. Additionally, we develop a variance matching regularization technique to correct sample variance discrepancies which can occur in latent diffusion models, enhancing image contrast and vibrancy, and further accelerating convergence.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8219

Loading