Video Diffusion Models: A Survey

Published: 15 Nov 2024, Last Modified: 15 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: \url{https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models}
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=sgDFqNTdaN
Changes Since Last Submission: > Organization of the paper was criticized, with lack of references in the first half [Pz4A]. We added citations throughout the Introduction and Background sections. > An introduction to diffusion models was desired [Pz4A] We added background in the Section 3. > Poor categorization of methods in the literature review was criticized [KyiL]. We reorganized the section and provided an improved taxonomy. Full taxonomy of methods provided in Section 2. > Scope of the paper was criticized. Discussion of video completion and unconditional video synthesis was desired, and relation of diffusion and non-diffusion based approaches [Pz4A]. Similarly, video editing [KyiL] and understanding tasks [KyiL,Pz4A] were also desired. We added discussion on video completion (Sec. 8), video editing (Sec. 10), unconditional synthesis (Sec. 6) and non-diffusion approaches (Sec. 8.2) . > Discussion of hybrid auto-regressive/diffusion approaches was desired [Pz4A]; We added details in Section 8.2. > Discussion of linear output projection in self-attention was desired [Pz4A]; We added discussion of it and clarified related statements n the Section 4.2. > Additional detail on issues with training from scratch and inconsistent alignments was desired [B2ge]; We added discussion of augmenting with image datasets in Sec. 6.2. > Additional references on temporal dynamics and visual examples of model failures were desired [B2ge]; We added references, visual examples of model failures added in Figure 5. > Discussion of training data was desired [Pz4A,B2ge], and a claim about lack of availability of labeled video data was criticized [Pz4A]; We added a section on data and modified the statement. Training data section reworked in Sec. 6. > Lack of quantitative analysis and comparisons of different models on benchmarks was criticized [KyiL,Pz4A,B2ge]. We added an overview of FVD benchmark comparisons. Benchmarks and quantitative analysis section added in Sec. 6.3. > Additional detailed conclusions and insights were desired [B2ge]; We added material on data. Outlook added in Section 12 and 13 > In-depth conclusions about quantitative metrics and comparing them on relevant data sets was desired [B2ge]; We referred to their added material on benchmarks > Additional details for the evaluation section were desired [KyiL]. We added benchmarks and quantitative analysis section in Sec. 6.3.
Assigned Action Editor: ~Yingnian_Wu1
Submission Number: 2856
Loading