COREDIT: SPATIAL COHERENCE-GUIDED TOKEN PRUNING AND RECONSTRUCTION FOR EFFICIENT DIF- FUSION TRANSFORMERS

COREDIT: SPATIAL COHERENCE-GUIDED TOKEN PRUNING AND RECONSTRUCTION FOR EFFICIENT DIF- FUSION TRANSFORMERS

ICLR 2026 Conference Submission21762 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: genai, diffusion, pruning, token, transformer, efficiency

Abstract: Diffusion Transformers (DiTs) have achieved remarkable results in image and video generation, but their high computational cost limits scalability and deployment. We introduce CoReDiT, a general-purpose token pruning framework across vision tasks tailored for DiTs. CoReDiT leverages spatial coherence to estimate token redundancy within local latent grids and selectively skips high-coherence tokens during self-attention. To preserve visual fidelity, we reconstruct the skipped token outputs through similarity-weighted aggregation from spatially neighboring retained tokens that have participated in self-attention computation. In addition, we propose a progressive pruning schedule that dynamically adapts pruning ratios across transformer blocks and denoising steps based on redundancy statistics. Applied to state-of-the-art diffusion backbones such as PixArt-alpha and MagicDrive-V2, CoReDiT achieves up to 55% reduction in self-attention FLOPs and latency speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Moreover, CoReDiT enables significantly higher resolution generation on mobile devices. Our results demonstrate that spatial coherence is a powerful signal for structured pruning in diffusion transformers.

Primary Area: generative models

Submission Number: 21762

Loading