AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

Published: 18 Nov 2024, Last Modified: 18 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP Scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.
Certifications: Reproducibility Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/TIGER-AI-Lab/AnyV2V
Supplementary Material: zip
Assigned Action Editor: ~Yizhe_Zhang2
Submission Number: 3123
Loading