TL;DR: History Guidance guides video diffusion models with any set of context frames and significantly enhances video quality
Abstract: Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website: [https://boyuan.space/history-guidance](https://boyuan.space/history-guidance)
Lay Summary: Creating high-quality, long, and realistic videos with AI is an exciting area of research, but current AI models often fall short. They typically generate only short videos and struggle to keep objects and scenes consistent over time.
Our paper tackles this challenge by introducing a new approach that enables AI models to better use information from any point in a video’s past, known as its “history.” This improved handling of history brings two main benefits. First, by more effectively remembering past content, our model produces videos that are more realistic, dynamic, and consistent. Second, by continually connecting past frames to newly generated ones, our method can create extremely long videos—something that was not possible with previous techniques.
We have open-sourced our method, called the Diffusion Forcing Transformer and History Guidance, making it easy for others to apply it to larger AI models. We hope this will help unlock new applications in areas such as media production and robotics, while also advancing the capabilities of AI video generation.
Link To Code: https://github.com/kwsong0113/diffusion-forcing-transformer
Primary Area: Applications->Computer Vision
Keywords: diffusion, video, guidance, generative models, 3d
Submission Number: 9254
Loading