MD-DiT: Step-aware Mixture-of-Depths for Efficient Diffusion Transformers

Mingzhu Shen; Pengtao Chen; Peng Ye; Guoxuan Xia; Tao Chen; Christos-Savvas Bouganis; Yiren Zhao

MD-DiT: Step-aware Mixture-of-Depths for Efficient Diffusion Transformers

Mingzhu Shen, Pengtao Chen, Peng Ye, Guoxuan Xia, Tao Chen, Christos-Savvas Bouganis, Yiren Zhao

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Training-free acceleration, diffusion transformers, Mixture-of-Depths.

Abstract: Diffusion models (DMs) excel in vision generation tasks such as Text-to-Image but face high computational demands due to their large timestep dimensions. While reducing the number of timesteps has been the primary focus of previous studies, our research aims to optimize DM inference efficiency by reconfiguring the model architecture, particularly for diffusion transformers (DiT). Drawing inspiration from mixture-of-depth (MD) models, we account for the computational asymmetry across different timesteps, acknowledging that each computational block contributes differently at each time step. This observation leads us to explore strategies to bypass certain computational blocks (block skipping) or reuse the results from previous timesteps (block caching). To this end, We introduce MD-DiT, a unified framework that optimizes diffusion transformers by integrating block skipping and caching through gradient-free search, allowing the model to select blocks at varying timesteps for improved inference efficiency. Our findings demonstrate a 20% reduction in computational cost for a 4-step Latent Consistency Model (LCM) and a 59% reduction in a 40-step setup. MD-DiT exceeds the performance of state-of-the-art training-free methods, such as DeepCache, TGATE, and T-Stitch.

Submission Number: 86

Loading