Keywords: video generation, real-time, interactive, diffusion model, adversarial, gan
TL;DR: Real-time interactive streaming video generation. 24fps up to a minute long.
Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to turn a pre-trained latent video diffusion model into
a real-time, interactive, streaming video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as control to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This allows us to design a more efficient architecture for one-step generation and to train the model in a student-forcing way to mitigate error accumulation. The adversarial approach also enables us to train the model for long-duration generation fully utilizing the KV cache. As a result, our 8B model achieves real-time, 24fps, nonstop, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames).
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 9607
Loading