Keywords: Video-to-Video Editing, Test Time Training, Domain Adaptation
Abstract: In the realm of generative AI, state-of-the-art Video-to-Video (V2V) editing models can perform diverse edits based on different conditions and generate new videos. Despite their ability to generate various video edits, these models still face significant frame inconsistencies, such as motion discrepancies and unnatural background changes. This paper addresses these issues by analyzing video inconsistencies through domain shifts and implementing domain control based on this theory. Furthermore, a test-time compute-optimal sampling method for better representation of different video domains is proposed, which is a high-performance test-time training (TTT) method. By leveraging this TTT method, we propose T3V2V (TTT-V2V editing). Our method utilizes frame-level information to establish an unsupervised TTT learning process, providing more precise guidance for the image-to-video (I2V) generation process and enhancing video consistency through effective self-supervised parameter optimization and domain adaptation. Extensive experiments on the DAVIS-EDIT benchmark show that T3V2V outperforms previous state-of-the-art models. The self-supervised nature of our TTT approach further enables robust generalization to diverse V2V editing tasks, establishing a new paradigm for V2V synthesis.
Submission Number: 12
Loading