Video Diffusion Models: A Survey

TMLR Paper2856 Authors

12 Jun 2024 (modified: 13 Jun 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Diffusion generative models have recently become a robust technique for producing and modifying coherent, high-quality video. This survey offers a systematic overview of critical elements of diffusion models for video generation, covering applications, architectural choices, and the modeling of temporal dynamics. Recent advancements in the field are summarized and grouped into development trends. The survey concludes with an overview of remaining challenges and an outlook on the future of the field.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=sgDFqNTdaN
Changes Since Last Submission: > Organization of the paper was criticized, with lack of references in the first half [Pz4A]. We added citations throughout the Introduction and Background sections. > An introduction to diffusion models was desired [Pz4A] We added background in the Section 3. > Poor categorization of methods in the literature review was criticized [KyiL]. We reorganized the section and provided an improved taxonomy. Full taxonomy of methods provided in Section 2. > Scope of the paper was criticized. Discussion of video completion and unconditional video synthesis was desired, and relation of diffusion and non-diffusion based approaches [Pz4A]. Similarly, video editing [KyiL] and understanding tasks [KyiL,Pz4A] were also desired. We added discussion on video completion (Sec. 8), video editing (Sec. 10), unconditional synthesis (Sec. 6) and non-diffusion approaches (Sec. 8.2) . > Discussion of hybrid auto-regressive/diffusion approaches was desired [Pz4A]; We added details in Section 8.2. > Discussion of linear output projection in self-attention was desired [Pz4A]; We added discussion of it and clarified related statements n the Section 4.2. > Additional detail on issues with training from scratch and inconsistent alignments was desired [B2ge]; We added discussion of augmenting with image datasets in Sec. 6.2. > Additional references on temporal dynamics and visual examples of model failures were desired [B2ge]; We added references, visual examples of model failures added in Figure 5. > Discussion of training data was desired [Pz4A,B2ge], and a claim about lack of availability of labeled video data was criticized [Pz4A]; We added a section on data and modified the statement. Training data section reworked in Sec. 6. > Lack of quantitative analysis and comparisons of different models on benchmarks was criticized [KyiL,Pz4A,B2ge]. We added an overview of FVD benchmark comparisons. Benchmarks and quantitative analysis section added in Sec. 6.3. > Additional detailed conclusions and insights were desired [B2ge]; We added material on data. Outlook added in Section 12 and 13 > In-depth conclusions about quantitative metrics and comparing them on relevant data sets was desired [B2ge]; We referred to their added material on benchmarks > Additional details for the evaluation section were desired [KyiL]. We added benchmarks and quantitative analysis section in Sec. 6.3.
Assigned Action Editor: ~Yingnian_Wu1
Submission Number: 2856
Loading