Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang; Yufan Deng; Shenghai Yuan; Xuehan Hou; Yian Zhao; Peng Jin; Chang Liu; Jie Chen

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang, Yufan Deng, Shenghai Yuan, Xuehan Hou, Yian Zhao, Peng Jin, Chang Liu, Jie Chen

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a ``Present-and-Align” paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7% and 11.7% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general T2V and compositional text-to-image (T2I) tasks. Code and models are available at: https://github.com/Hong-yu-Zhang/Comp-attn.

Lay Summary: Today’s text-to-video systems can create impressive videos from written prompts, but they often struggle when a scene contains several objects or people that must appear in the right places and interact correctly. For example, a model may forget one subject, place objects on the wrong side, or confuse which action belongs to which subject. We introduce Comp-Attn, a simple add-on that helps video generation models better follow complex prompts. Our idea is to split the problem into two parts: first make sure every mentioned subject appears, and then make sure the subjects are arranged and related correctly. To do this, Comp-Attn strengthens the model’s understanding of each individual subject and gently guides the model’s attention toward a planned layout of the scene. Importantly, Comp-Attn does not require retraining the video model. It can be plugged into several existing video generation systems and improves their ability to create multi-subject scenes, while adding only a small amount of extra generation time.

Originally Submitted Supplementary Material: zip

Primary Area: Applications->Computer Vision

Keywords: video generation

Originally Submitted PDF: pdf

Submission Number: 489

Loading