Keywords: Multi-Subject Video Generation, Reinforcement Learning for Generation, Semantic Understanding for Generation
Abstract: Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability.
We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency.
To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer~designs a **hierarchical identity-preserving attention mechanism**, which effectively aggregates features within and across subjects and modalities.
To effectively allow for the semantic following of user intention, we introduce
**semantic understanding via pretrained vision-language model (VLM)**, leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects.
Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID,
we employ an **online reinforcement learning phase** to drive the overall training objective of ID-Composer into RLVR.
Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
Code and training data will be released.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3057
Loading