Keywords: Video Generative AI, Accelerate
Abstract: The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan 2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive prior (e.g., layout, semantic, motion) of the original model. In this work, we propose a new framework for efficient high-resolution video generation, while preserving the pretrained prior. Specifically, we divide video generation into two stages: First, we leverage the pretrained model to generate a low-resolution preview quickly; then, we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe prior losses. To address this, we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps at the original resolution and switching to lower resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, significantly reducing the denoising steps. We also integrate shifting windows and carefully design the training paradigm to create a powerful and efficient Refiner. In this way, our method enables efficient generation of high-resolution videos while remaining close to the prior of the given pretrained model. The method is conceptually simple and could serve as a plug-in compatible with various base models and acceleration methods. For example, it achieves a 12.5x speedup for generating 5-second, 16fps, 720p videos.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 2025
Loading