Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts

Kun Cheng; Xiao He; Lei Yu; Zhijun Tu; Mingrui Zhu; Nannan Wang; Xinbo Gao; Jie Hu

Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts

Kun Cheng, Xiao He, Lei Yu, Zhijun Tu, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce Diff-MoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexibility. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.

Lay Summary: (1) Diffusion models excel at generating high-quality data but face scalability challenges due to heavy computation and one-size-fits-all processing across timesteps and spatial tokens. (2) We introduce Diff‑MoE, which augments Diffusion Transformers with a Mixture‑of‑Experts architecture: each expert receives its own timestep conditioning and handles different spatial tokens, enabling dynamic allocation of compute based on where and when it’s most needed. (3) We further enhance expert modules with a globally‑aware feature recalibration mechanism that amplifies relevant signals on the fly. Extensive image generation experiments show Diff‑MoE outperforms current state‑of‑the‑art methods, demonstrating a practical path to more scalable and adaptable generative modeling.

Link To Code: https://github.com/kunncheng/Diff-MoE

Primary Area: Applications->Computer Vision

Keywords: Diffusion Transformer, Mixture of Experts, Image Generation

Submission Number: 1197

Loading