Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Yike Yuan; Ziyu Wang; Zihao Huang; Defa Zhu; Xun Zhou; Jingyi Yu; Qiyang Min

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: MoE with flexible routing strategy designed for diffusion models

Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Lay Summary: It is observed that the entire diffusion process actually consists of numerous sub-tasks with varying levels of difficulty; evidently, denoising a pure Gaussian noise is more difficult than denoising an almost fully clean image. Currently, these tasks are all addressed by the same model. We aim to employ models of different sizes to handle tasks with different levels of difficulty. The Mixture of Experts (MoE) technique is commonly used to scale up model capacity, and we find that it is also effective when applied to diffusion models. Furthermore, since the model incorporates a number of experts, we extend the routing strategy to enable the model to autonomously learn to activate different numbers of experts with different sub-tasks. This results in a model with dynamic size that can adaptively adjust to the complexity of each task. Compared to previous MoE approaches, our method enables much more efficient model scaling with simple modification to the routing strategy. Also, we demonstrate the potential of leveraging dynamics in diffusion models.

Primary Area: Deep Learning->Foundation Models

Keywords: Mixture of Experts, Diffusion Models

Submission Number: 1476

Loading