Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an *independence-aware channel pruning* method to effectively mitigate severe channel redundancy, and (2) a *stage-wise dominant operator optimization* strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a **Flash-VAED** family. Moreover, we design a *three-phase dynamic distillation* framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a **6$\times$ speedup** while maintaining the reconstruction performance up to **96.9%**. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to **36%** with negligible quality drops on VBench-2.0. Our code is available at https://github.com/Aoko955/Flash-VAED.
Lay Summary: Recently, artificial intelligence-generated content has witnessed remarkable breakthroughs in video generation, producing increasingly realistic and coherent videos. Much of this success can be attributed to powerful latent diffusion models, which typically consist of a diffusion transformer (DiT) that generates content through iterative denoising in a compact latent space, and a VAE decoder that converts the generated latent representation back into visible frames. Despite their strong performance, video generation models remain computationally expensive and slow during inference, which hinders practical deployment. Prior acceleration research has mainly focused on the DiT module, either by reducing denoising steps or applying model compression, but as DiT acceleration advances, the latency bottleneck has gradually shifted toward the largely overlooked VAE decoder. In this work, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent space, and based on it, we construct the Flash-VAED family. We identify two key sources of decoding latency: severe channel redundancy and the high cost of causal 3D convolutions. To address them, we introduce independence-aware channel pruning to retain a small set of informative channels that can reconstruct full channel feature maps, and stage-wise dominant operator optimization to replace costly causal 3D convolutions with efficient operators tailored to different decoder stages. We further develop a three-phase dynamic distillation framework to transfer the original decoder's capability to Flash-VAED. Experiments on Wan and LTX-Video VAE decoders show that Flash-VAED achieves about $6\times$ decoding speedup while maintaining up to $96.9\%$ reconstruction performance, and accelerates full video generation pipelines by up to $36\%$ with negligible quality drops on VBench-2.0.
Link To Code: https://github.com/Aoko955/Flash-VAED
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Video VAE Decoders, Channel Pruning, Causal 3D Convolutions, Feature Distillation
Originally Submitted PDF: pdf
Submission Number: 6571
Loading