Identity-Motion Trade-offs in Text-to-Video via Query-Guided Attention Priors

Yuval Atzmon; Rinon Gal; Yoad Tewel; Yoni Kasten; Gal Chechik

Identity-Motion Trade-offs in Text-to-Video via Query-Guided Attention Priors

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik

Published: 20 Aug 2025, Last Modified: 20 Aug 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-video, motion transfer, query features, self attention, motion prior, structural video prior

TL;DR: We find that Q-features in video diffusion models encode both structure and identity—unlike image models (structure only)—creating unexpected trade-offs. Using Q as priors from source videos enable zero-shot motion transfer and multi-shot consistency

Abstract: Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity - revealing these features as key structural priors that control video generation. Our analysis shows that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10x more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Supplementary Material: zip

Submission Number: 12

Loading