Keywords: Video Generation, Diffusion based models Tr, model distillation
TL;DR: we propose a new asymmetric structural distillation method that produce videos with superior quality.
Abstract: Due to bidirectional attention dependencies, video generation models generally suffer from $O(n^2)$ computational complexity. In this work, we find the “local inter-frame information redundancy" phenomenon which indicates strong local temporal dependencies in video generation, with global attention to distant frames contributing only marginally. Built upon this finding, we introduce a novel distillation training paradigm for video diffusion models, namely GREEDY DISTILL.
Specifically, to generate the next frame using only the 0-th and the last frames, we propose the Streaming Diffusion Decoder (SDD) as the “Greedy Decoder" to avoid redundant computational costs from the other frames.
Meanwhile, to our knowledge, we introduce Efficient Temporal Module (ETM) to capture the global temporal information across frames.
These two modules achieve the computational complexity reduction from $O(n^2)$ to linear. Moreover, we make the first attempt to apply RL fine-tuning to address the error accumulation during streaming generation.
Our method achieves an overall score of 84.60 on the VBench benchmark, surpassing previous state-of-the-art methods by large margins(+4.18%). Qualitative results also demonstrate superior performance.
Leveraging its efficient model structure and KV cache, it is able to rapidly generate high-quality video streams at 24 FPS (nearly 50% faster) on a single H100 GPU.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5458
Loading