UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Tian Xia; Xuweiyi Chen; Sihan Xu

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Tian Xia, Xuweiyi Chen, Sihan Xu

Published: 28 Nov 2024, Last Modified: 28 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Following the suggestions from the AE and reviewers, we replaced examples that appeared static with more dynamic ones to better showcase the results.

Code: https://github.com/XuweiyiChen/UniCtrl

Supplementary Material: zip

Assigned Action Editor: ~Ran_He1

Submission Number: 3365

Loading