ViDE: Tuning-Free Video Coherence via Temporal Attention Reweighting and Prompt Blending

16 Apr 2026 (modified: 23 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the substantial progress in long video generation, multi-prompt synthesis often encounters inconsistencies resulting from the training-inference gap caused by length extension techniques and coarse prompt interpolation. To overcome these issues, we propose the Video Diffusion with hidden states Editing (ViDE) framework, which consists of two key components. The first is the Time-frequency based Temporal Attention Reweighting (TiTAR) algorithm, which leverages the relationship between inconsistencies and diagonal elements of temporal attention. By reweighting the attention scores via the Discrete Short-Time Fourier Transform (DSTFT), TiTAR effectively reduces frame inconsistencies, a capability further corroborated by a Fourier-based analysis. The second component, PromptBlend, reduces inconsistencies in multi-prompt settings through fine-grained prompt alignment and adaptive interpolation, enabling smooth semantic transitions. Extensive experiments demonstrate the effectiveness of ViDE, demonstrating consistent and significant improvements over multiple baselines.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Long_Chen8
Submission Number: 8459
Loading