Keywords: Generative Model, Video Diffusion, Training-Free Video Editing
Abstract: Existing text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and cross-domain transformation error. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing technique that produces visually appealing and temporally coherent videos through latent optimization guided by our novel STR score. The proposed score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal attention maps in text-to-video~(T2V) diffusion models, without the overhead of computationally expensive full 3D attention. Integrated into a latent optimization framework with a latent mask, STR-Match generates high-fidelity videos with strong spatiotemporal consistency, preserving key visual attributes of the source video while remaining robust under significant domain shifts. Our extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 22764
Loading