Keywords: Long video generation, Diffusion Models, Transformer
TL;DR: 60-second long video generation by autoregressive video denoising with progressive noise levels.
Abstract: Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained correspondence among the latents and large overlaps between the attention windows. Such progressive video denoising enables our models to autoregressively generate video frames without temporal inconsistency or quality degradation over time. We present the first results on text-conditioned 60-second (1440 frames) long video generation at the quality close to frontier models.
Supplementary Material: zip
Submission Number: 28
Loading