Keywords: Text-to-Video Generation, Diffusion Models, Diffusion Guidance, Zero-shot Image-to-Video Generation
TL;DR: We present two new sampling methods for Text-to-Video (T2V) diffusion models that enhance pre-trained models, allowing for dynamic scene generation and zero-shot image-to-video and image-image-to-video generation (based on the first and last frames).
Abstract: Current text-to-video (T2V) models have made significant progress in generating high-quality video. However, these models are limited when it comes to generating dynamic video scenes where the description per frame can vary dramatically. Changing the color, shape, position and state of objects in the scene is a challenge that current video models cannot handle. In addition, the lack of a cheap image-based conditioning mechanism limits their creative application. To address these challenges and extend the applicability of T2V models, we propose two innovative approaches: **State Guidance** and **Image Guidance**. **State Guidance** uses advanced guidance mechanisms to control motion dynamics and scene transformation smoothness by navigating the diffusion process between a state triplet <initial state, transition state, final state>. This mechanism enables the generation of dynamic video scenes (Dynamic Scene T2V) and allows to control the speed and the expressiveness of the scene transformation by introducing temporal dynamics via a guidance weight schedule across video frames. **Image Guidance** enables Zero-Shot Image-to-Video generation (Zero-Shot I2V) by injecting reference image into the initial diffusion steps noise predictions. Furthermore, the combination of **State Guidance** and **Image Guidance** allows for zero-shot transitions between two input reference frames of a video (Zero-Shot II2V). Finally, we introduce the novel **Dynamic Scene Benchmark** to evaluate the ability of the models to generate dynamic video scenes. Extensive experiments show that **State Guidance** and **Image Guidance** successfully address the aforementioned challenges and significantly improve the generation capabilities of existing T2V architectures.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7611
Loading