ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation

ICLR Submission #3057

ID-Composer is a multi-subject video synthesis model that
generates subject-consistent videos with multiple references.

🧩   Abstract   ðŸ§©

Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency.

The key designs of ID-Composer are twofold: (1) A hierarchical identity-preserving attention mechanism, which effectively aggregates features hierarchically within and across subjects and modalities, enabling identity consistency and textual faithfulness; (2) Semantic understanding via pretrained vision-language model (VLM), leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. An online reinforcement learning phase is further introduced to enhance video quality and identity preservation. Extensive experiments demonstrate that ID-Composer surpasses existing methods in identity preservation, temporal consistency, and video quality. Code and data will be released.

🔮   Method   ðŸ”®

Architecture

The capability of ID-Composer in generating multi-subject videos from a text prompt and multiple reference images is achieved by (1) Hierarchical Identity-Preserving Attention, which aggregates features both within and across subjects and modalities, ensuring identity consistency and faithful textual alignment; (2) Semantic Guidance via Pretrained Vision-Language Models (VLMs), leveraging VLMs' rich semantic understanding to capture fine-grained interactions among multiple subjects and modalities. (3) An online reinforcement learning phase is further employed to enhance video quality and preserve subject identities across time. Extensive experiments show that ID-Composer outperforms previous methods in identity preservation, temporal consistency, and overall video quality.

Dataset Curation

Statistics of the constructed dataset. The dataset is organized into four primary scenarios: Human, Objects, Environment, and Nature, each containing a variety of subcategories.

🧩   Results of ID-Composer   ðŸ§©

Input Image Phantom-14B VACE-14B Kling 1.6 Ours
The video features a man with a rugged beard, wearing a leather jacket, riding a vintage motorcycle along a desert highway. His expression is focused, eyes narrowed slightly against the wind, as the setting sun casts a warm glow over the landscape. The highway stretches endlessly, bordered by arid land with occasional cacti and rocky outcrops. The motorcycle roars smoothly, leaving a light trail of dust. In the distance, hazy mountains are silhouetted against the amber sky. The scene suggests adventure and determination, evoking freedom, with the man riding purposefully through the tranquil, sunlit desert.
Input Image Phantom-14B VACE-14B Kling 1.6 Ours
The video begins with a close-up view of a smartphone with a pink case resting on an open notebook, while a person's hand is seen typing on a laptop keyboard in the background. The scene is set on a light-colored desk, and the focus is on the smartphone and the person's hand. As the video progresses, the smartphone's screen lights up, displaying a notification with a red badge and a message. The person's hand then reaches for the smartphone, picks it up, and interacts with it, possibly swiping or tapping on the screen. The person holds the smartphone in their hand, with the laptop still visible in the background. The video continues with the person holding the smartphone in their left hand, while their right hand is seen typing on the laptop keyboard. The smartphone's screen is off, and the person appears to be interacting with the laptop. The scene remains consistent with the light-colored desk and the open notebook. The video concludes with the person still holding the smartphone in their left hand and continuing to type on the laptop with their right hand.
Input Image Phantom-14B VACE-14B Vidu 2.0 Ours
The video begins with a close-up view of a smartphone with a pink case resting on an open notebook, while a person's hand is seen typing on a laptop keyboard in the background. The scene is set on a light-colored desk, and the focus is on the smartphone and the person's hand. As the video progresses, the smartphone's screen lights up, displaying a notification with a red badge and a message. The person's hand then reaches for the smartphone, picks it up, and interacts with it, possibly swiping or tapping on the screen. The person holds the smartphone in their hand, with the laptop still visible in the background. The video continues with the person holding the smartphone in their left hand, while their right hand is seen typing on the laptop keyboard. The smartphone's screen is off, and the person appears to be interacting with the laptop. The scene remains consistent with the light-colored desk and the open notebook. The video concludes with the person still holding the smartphone in their left hand and continuing to type on the laptop with their right hand.
Input Image Phantom-14B VACE-14B Vidu 2.0 Ours
The video showcases a serene and cozy outdoor setting featuring a stone fireplace with a fire burning inside, situated on a wooden deck. Adjacent to the fireplace is a wicker armchair with a white cushion, and in front of the chair, there is a small wooden table. The deck is surrounded by wooden railings, and the background reveals a dense forest with bare trees, indicating a winter season. The ground is lightly covered with snow, enhancing the wintry atmosphere. Throughout the video, the scene remains consistent with no noticeable changes in the environment, objects, or camera movement, maintaining a tranquil and inviting ambiance.