Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo III SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: hardware-efficiency, dynamic sparsity, activation caching, diffusion transformer
TL;DR: A hardware-efficient method of speeding up Diffusion Transformer generation by computing dynamically sparse deltas against cached activations in attention and MLP layers.
Abstract: Diffusion Transformers (DiTs) excel in generating high-quality images and videos but suffer from redundant computations at inference, increasing costs. Observing that only a small fraction (5-25\%) of activations in attention and MLP layers account for 70-90\% of the change across inference steps, we introduce Chipmunk, a dynamic sparsity method that recomputes only these rapidly changing activations while caching the remainder. Dynamic sparsity, however, poses system-level challenges, specifically GPU tensor core underutilization and additional runtime overhead from computing sparsity patterns and managing cached activations. To maximize GPU efficiency and approximation quality, Chipmunk employs voxel-based token reordering and efficient column-sparse kernels, achieving a 9.3x kernel speedup at 93\% sparsity. Chipmunk also overlaps sparsity pattern computation and cache updates with ongoing computation to mask overhead latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.56x speedup on FLUX.1-dev with minimal quality impact.
Submission Number: 86
Loading