Abstract: In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a ``Present-and-Align” paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7% and 11.7% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general T2V and compositional text-to-image (T2I) tasks. Code and models are available at: https://github.com/Hong-yu-Zhang/Comp-attn.
Lay Summary: Today’s text-to-video systems can create impressive videos from written prompts, but they often struggle when a scene contains several objects or people that must appear in the right places and interact correctly. For example, a model may forget one subject, place objects on the wrong side, or confuse which action belongs to which subject.
We introduce Comp-Attn, a simple add-on that helps video generation models better follow complex prompts. Our idea is to split the problem into two parts: first make sure every mentioned subject appears, and then make sure the subjects are arranged and related correctly. To do this, Comp-Attn strengthens the model’s understanding of each individual subject and gently guides the model’s attention toward a planned layout of the scene.
Importantly, Comp-Attn does not require retraining the video model. It can be plugged into several existing video generation systems and improves their ability to create multi-subject scenes, while adding only a small amount of extra generation time.
Originally Submitted Supplementary Material: zip
Primary Area: Applications->Computer Vision
Keywords: video generation
Originally Submitted PDF: pdf
Submission Number: 489
Loading