Keywords: diffusion models, mixture of experts, flow-based generation, image generation
TL;DR: DiffMoE is a dynamic MoE Transformer that outperforms 3× larger dense models in diffusion tasks, using global token pool and adaptive routing while keeping 1× parameter activation.
Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation tasks, yet their uniform processing of inputs across varying conditions and noise levels fails to leverage the inherent heterogeneity of the diffusion process. While recent mixture-of-experts (MoE) approaches attempt to address this limitation, they struggle to achieve significant improvements due to their restricted token accessibility and fixed computational patterns. We present **DiffMoE**, a novel MoE-based architecture that enables experts to access global token distributions through a **batch-level global token pool** during training, promoting specialized expert behavior. To unleash the full potential of inherent heterogeneity, DiffMoE incorporates a **capacity predictor** that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion transformers on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 18370
Loading