Keywords: Video Diffusion, Video Motion Editing
Abstract: Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, none can edit motion in video. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general, real-world videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt. As maintaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity. We propose to improve motion editability by using a mixed objective that jointly finetunes with full temporal attention and with temporal attention masking. We extend our method for animating images, bringing them to life by adding motion to existing or new objects, and camera movements. Extensive experiments showcase our method's remarkable ability to edit motion in videos.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1227
Loading