<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>ID-Composer</title>
  <link rel="stylesheet" href="./static/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet" href="./static/css/index.css">  
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>


<style>
  .morphing-text {
      background-image: linear-gradient(to right, 
       #ca12f3 0%, #5f0d80 25%, #2892a0 75%, #50d5e9 100%
      );
      -webkit-background-clip: text;
      background-clip: text;
      color: transparent;
  }
  .center-img {
    display: block;
    margin-left: auto;
    margin-right: auto;
  }
  .result-media {
    height: 160px;
    width: 100%;
    object-fit: contain;
    vertical-align: middle;
  }
  .result-row td {
    padding: 0 5px;
  }
</style>

<body>

  <!-- Section 1: Banner -->
  <section class="hero">
    <div class="hero-body">
      <div class="container is-max-desktop">
        <div class="columns is-centered">
          <div class="column has-text-centered">

            <!-- Title -->
            <h1 class="title is-2 publication-title">
              <span class="morphing-text model-name">ID-Composer</span>:
              <span>Multi-Subject Video Synthesis with Hierarchical Identity Preservation</span>
            </h1>

            <div class="is-size-5 publication-info">
              <b>ICLR Submission #3057</b>
            </div>

          </div>
        </div>
      </div>
    </div>
  </section>

  <!-- Section 2: Teaser -->
  <section class="hero teaser">
    <div class="container is-max-desktop">
      <div class="hero-body">
        <div style="text-align: center;">
          <img src="./static/images/teaser.png" width="100%" class="center-img"/>
      </div>
        <h2 class="subtitle has-text-centered">
          <span class="morphing-text model-name"><b>ID-Composer</b></span> <b>is a multi-subject video synthesis model that</b>
          <br>
          <b>generates subject-consistent videos with multiple references. </b>
        </h2>

      </div>
    </div>
  </section>

  <!-- Section 3: Abstract -->
  <section style="margin-top: -5pt;" class="section">
    <div class="container is-max-desktop">

      <!-- Abstract -->
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">🧩&nbsp;&nbsp; Abstract &nbsp;&nbsp;🧩</h2>
          <div class="content has-text-justified">
            <p>
            Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. 
            We introduce <span class="morphing-text model-name"><b>ID-Composer</b></span>, <b>a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images</b>. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency.
            </p>
            <p>
            The key designs of <span class="morphing-text model-name"><b>ID-Composer</b></span>  are twofold: (1) <b>A hierarchical identity-preserving attention mechanism</b>, which effectively aggregates features hierarchically within and across subjects and modalities, enabling identity consistency and textual faithfulness; (2) <b>Semantic understanding via pretrained vision-language model (VLM)</b>, leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. 
            <b>An online reinforcement learning phase</b> is further introduced to enhance video quality and identity preservation. 
            Extensive experiments demonstrate that <span class="morphing-text model-name"><b>ID-Composer</b></span>  surpasses existing methods in identity preservation, temporal consistency, and video quality.
            Code and data will be released.
            </p>
          </div>
        </div>
      </div>

    </div>
  </section>

  <!-- Section 3: Method -->
  <section class="section">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column is-full-width">

          <h2 class="title is-3 has-text-centered">🔮&nbsp;&nbsp; Method &nbsp;&nbsp;🔮</h2>
          <h3 class="title is-4">Architecture</h3>
          <div class="content has-text-justified">
            <img src="./static/images/overview.png" width="100%" class="center-img"/>
            <p>
              The capability of <span class="morphing-text model-name"><b>ID-Composer</b></span> in generating multi-subject videos from a text prompt and multiple reference images is achieved by
              <b>(1) Hierarchical Identity-Preserving Attention</b>, which aggregates features both within and across subjects and modalities, ensuring identity consistency and faithful textual alignment;
              <b>(2) Semantic Guidance via Pretrained Vision-Language Models (VLMs)</b>, leveraging VLMs' rich semantic understanding to capture fine-grained interactions among multiple subjects and modalities.
              <b>(3) An online reinforcement learning phase</b> is further employed to enhance video quality and preserve subject identities across time. Extensive experiments show that <span class="morphing-text model-name"><b>ID-Composer</b></span> outperforms previous methods in identity preservation, temporal consistency, and overall video quality.
            </p>
          </div>

          <h3 class="title is-4">Dataset Curation</h3>
          <div class="content has-text-justified">
            <img src="./static/images/dataset.png" width="100%" class="center-img"/>
            <p style="margin-top: 5pt;">
              <b>Statistics of the constructed dataset. The dataset is organized into four primary scenarios: Human, Objects, Environment, and Nature, each containing a variety of subcategories.</b>
            </p>
          </div>

        </div>
      </div>
    </div>
  </section>

  <!-- Section 4: Results -->
  <section class="section" id="results">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column is-full-width">

          <h2 class="title is-3 has-text-centered">🧩&nbsp;&nbsp; Results of <span class="morphing-text model-name"><b>ID-Composer</b></span> &nbsp;&nbsp;🧩</h2>
          <div style="text-align:center">
            <table style="width: 100%; table-layout: fixed;">
              <tbody>

                <tr class="prompt-row" style="text-align: center;">
                  <td>Input Image</td>
                  
                  <td>Phantom-14B</td>
                  <td>VACE-14B</td>
                  <td>Kling 1.6</td>
                  <!-- <td>Vidu 2.0</td> -->
                  <td><span class="morphing-text model-name">Ours</span></td>
                </tr>

                <!-- Example 1 -->
                <tr class="result-row">
                  <td><img src="static/images/videos/input1.png" class="result-media"></img></td>
                  <td><video src="static/images/videos/phantom1.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vace1.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/keling1.mp4" autoplay loop muted class="result-media"></video></td>
                  <!-- <td><video src="static/images/videos/vidu1.mp4" autoplay loop muted class="result-media"></video></td> -->
                   <td><video src="static/images/videos/our1_480p.mp4" autoplay loop muted class="result-media"></video></td>
                </tr>
                <tr>
                  <td colspan="5" style="text-align: center; padding-top: 10px;">
                    The video features a man with a rugged beard, wearing a leather jacket, riding a vintage motorcycle along a desert highway. His expression is focused, eyes narrowed slightly against the wind, as the setting sun casts a warm glow over the landscape. The highway stretches endlessly, bordered by arid land with occasional cacti and rocky outcrops. The motorcycle roars smoothly, leaving a light trail of dust. In the distance, hazy mountains are silhouetted against the amber sky. The scene suggests adventure and determination, evoking freedom, with the man riding purposefully through the tranquil, sunlit desert.
                  </td>
                </tr>

                <tr class="prompt-row" style="text-align: center;">
                  <td>Input Image</td>
                  <td>Phantom-14B</td>
                  <td>VACE-14B</td>
                  <td>Kling 1.6</td>
                  <!-- <td>Vidu 2.0</td> -->
                  <td><span class="morphing-text model-name">Ours</span></td>
                </tr>

                <tr class="result-row">
                  <td><img src="static/images/videos/input2.png" class="result-media"></img></td>
                  <td><video src="static/images/videos/phantom2.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vace2.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/keling2.mp4" autoplay loop muted class="result-media"></video></td>
                   <td><video src="static/images/videos/our2_480p.mp4" autoplay loop muted class="result-media"></video></td>
                </tr>
                <tr>
                  <td colspan="5" style="text-align: center; padding-top: 10px;">
                    The video begins with a close-up view of a smartphone with a pink case resting on an open notebook, while a person's hand is seen typing on a laptop keyboard in the background. The scene is set on a light-colored desk, and the focus is on the smartphone and the person's hand. As the video progresses, the smartphone's screen lights up, displaying a notification with a red badge and a message. The person's hand then reaches for the smartphone, picks it up, and interacts with it, possibly swiping or tapping on the screen. The person holds the smartphone in their hand, with the laptop still visible in the background. The video continues with the person holding the smartphone in their left hand, while their right hand is seen typing on the laptop keyboard. The smartphone's screen is off, and the person appears to be interacting with the laptop. The scene remains consistent with the light-colored desk and the open notebook. The video concludes with the person still holding the smartphone in their left hand and continuing to type on the laptop with their right hand.
                  </td>
                </tr>

                <tr class="prompt-row" style="text-align: center;">
                  <td>Input Image</td>
                  <td>Phantom-14B</td>
                  <td>VACE-14B</td>
                  <!-- <td>Kling 1.6</td> -->
                  <td>Vidu 2.0</td>
                  <td><span class="morphing-text model-name">Ours</span></td>
                </tr>
                  <tr class="result-row">
                  <td><img src="static/images/videos/input3.png" class="result-media"></img></td>
                  <td><video src="static/images/videos/phantom3.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vace3.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vidu3.mp4" autoplay loop muted class="result-media"></video></td>
                   <td><video src="static/images/videos/our3_480p.mp4" autoplay loop muted class="result-media"></video></td>
                </tr>
                <tr>
                  <td colspan="5" style="text-align: center; padding-top: 10px;">
                    The video begins with a close-up view of a smartphone with a pink case resting on an open notebook, while a person's hand is seen typing on a laptop keyboard in the background. The scene is set on a light-colored desk, and the focus is on the smartphone and the person's hand. As the video progresses, the smartphone's screen lights up, displaying a notification with a red badge and a message. The person's hand then reaches for the smartphone, picks it up, and interacts with it, possibly swiping or tapping on the screen. The person holds the smartphone in their hand, with the laptop still visible in the background. The video continues with the person holding the smartphone in their left hand, while their right hand is seen typing on the laptop keyboard. The smartphone's screen is off, and the person appears to be interacting with the laptop. The scene remains consistent with the light-colored desk and the open notebook. The video concludes with the person still holding the smartphone in their left hand and continuing to type on the laptop with their right hand.
                  </td>
                </tr>

                                <tr class="prompt-row" style="text-align: center;">
                  <td>Input Image</td>
                  <td>Phantom-14B</td>
                  <td>VACE-14B</td>
                  <!-- <td>Kling 1.6</td> -->
                  <td>Vidu 2.0</td>
                  <td><span class="morphing-text model-name">Ours</span></td>
                </tr>
                  <tr class="result-row">
                  <td><img src="static/images/videos/input4.png" class="result-media"></img></td>
                  <td><video src="static/images/videos/phantom4.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vace4.mp4" autoplay loop muted class="result-media"></video></td>
                  <td><video src="static/images/videos/vidu4.mp4" autoplay loop muted class="result-media"></video></td>
                   <td><video src="static/images/videos/our4.mp4" autoplay loop muted class="result-media"></video></td>
                </tr>
                <tr>
                  <td colspan="5" style="text-align: center; padding-top: 10px;">
                    The video showcases a serene and cozy outdoor setting featuring a stone fireplace with a fire burning inside, situated on a wooden deck. Adjacent to the fireplace is a wicker armchair with a white cushion, and in front of the chair, there is a small wooden table. The deck is surrounded by wooden railings, and the background reveals a dense forest with bare trees, indicating a winter season. The ground is lightly covered with snow, enhancing the wintry atmosphere. Throughout the video, the scene remains consistent with no noticeable changes in the environment, objects, or camera movement, maintaining a tranquil and inviting ambiance.
                  </td>
                </tr>

   

              </tbody>
            </table>
          </div>
    
        </div>
      </div>
    </div>
  </section>

  <!-- Section -1: Footer -->
  <footer class="footer">
    <div class="container">

      <!-- Copyright -->
      <div class="columns is-centered">
        <div class="column is-8">
          <div class="content">
            <p style="text-align: center">
              Website template is borrowed from <a href="https://nerfies.github.io">Nerfies</a>.
              Thanks for their effort.
            </p>
          </div>
        </div>
      </div>

    </div>
  </footer>

</body>

</html>
