AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Wenhao Sun; Rong-Cheng Tu; Jingyi Liao; Zhao Jin; Dacheng Tao

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A training-free token-reduction approach to accelerate state-of-the-art video DiTs without compromising generation quality.

Abstract: Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video diffusion sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (**AsymRnR**), **a training-free and model-agnostic method to accelerate video DiTs**. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.

Lay Summary: Diffusion Transformers (DiTs) demand heavy computing power and energy to generate videos. Most of the computational burden arises from the model processing a large number of tokens, each representing a small patch of a video frame, that interact with each other at every step. Our approach, Asymmetric Reduction and Restoration (AsymRnR), spots when many of this computation is redundant and safely skips them in a step-dependent way. When the skipped information is needed again, AsymRnR quickly restores it so the final video stays sharp and coherent. The method simply requires no retraining or fine-tuning and works with any modern video DiT right out of the box. In experiments, it reduces running time by nearly one third while keeping visual quality. By making state-of-the-art video generation cheaper, faster and greener, AsymRnR lowers the barrier for creative storytelling, simulation and scientific visualisation.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/wenhao728/AsymRnR

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Video Generation, Diffusion Models, Video Transformers, Efficient Diffusion

Submission Number: 230

Loading