The Blessing of Smooth Initialization for Video Diffusion Models

Shitong Shao; Bai LiChen; zikai zhou; Tian Ye; Yunfeng Cai; Kaishun Wu; Zeke Xie

The Blessing of Smooth Initialization for Video Diffusion Models

Shitong Shao, Bai LiChen, zikai zhou, Tian Ye, Yunfeng Cai, Kaishun Wu, Zeke Xie

24 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Smoothing, Video Diffusion Models, Noise Initialization

TL;DR: We propose a training-free paradigm that optimizes the initial Gaussian noise by introducing a targeted semantic prior bias into the sampling process from a smoothing perspective.

Abstract: Extending the success of text-to-image (T2I) synthesis to text-to-video (T2V) synthesis is a promising direction for visual generative AI. Popular training-free sampling algorithms currently generate high-fidelity images within the Stable Diffusion family. However, when applied to video diffusion models (VDMs), these techniques result in limited diversity and quality due to the low-quality data in a video datasets. We focus on inference to mitigate this issue, and then we propose a training-free paradigm that optimizes the initial Gaussian noise by introducing a targeted semantic prior bias into the sampling process from a smoothing perspective. The paradigm significantly improves both the fidelity and semantic faithfulness of the synthesized videos. Guided by theoretical analysis using random smoothing and differential equations, our resulting method SmoothInit can be understood as approximately incorporating third-order derivatives into gradient descent, which contributes to be better convergence in learning semantic information. A more efficient version, Fast-SmoothInit, is proposed to achieve better experimental results by leveraging a momentum mechanism. Both SmoothInit and Fast-SmoothInit demonstrate promising empirical results across various benchmarks, including UCF-101/MSR-VTT-related FVD, Chronomagic-bench, and T2V-Compbench, setting a new standard for noise initialization in VDMs.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3555

Loading