MoB: Mixture of Block Transformer for Accelerating Video Generation with Dynamic Routing

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Transformer, Video Generation, Efficiency Improvement
Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation tasks. However, their iterative denoising process introduces substantial computational redundancy within Transformer modules, resulting in prohibitively high computational costs and slow inference speeds. Through comprehensive experimental analysis of existing DiTs, we reveal two key observations: (1) outputs of different Transformer blocks exhibit significant similarity during the denoising process, and (2) block-level redundancy varies dynamically across denoising timesteps. Based on these insights, we propose \textbf{Mixture of Blocks (MoB)}, the first framework to introduce block-level dynamic routing for DiT acceleration. The core innovation of MoB lies in a lightweight routing network that dynamically evaluates the importance of each Transformer block based on input prompts. At each denoising step, we propose the Ada-Top-\(k\) mechanism which selects relevant blocks by using the k-th largest score as an adaptive threshold, avoiding the winner-take-all problem of traditional soft selection while eliminating 10-20\% of redundant computations. To compensate for information loss from skipped blocks, we design a Block Cache mechanism that maintains generation quality by reusing intermediate feature differences from previous timesteps. Furthermore, MoB integrates adaptive timestep skipping and employs knowledge distillation to train the routing network, achieving enhanced inference efficiency and training stability. In addition, we evaluate its generalization ability on image generation tasks using Flux.1. Extensive experiments demonstrate that MoB achieves significant inference acceleration while preserving generation fidelity in both video and image generation tasks, consistently outperforming existing baseline methods in both efficiency and quality.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11243
Loading