<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width,initial-scale=1" />
  <meta name="theme-color" content="#000000" />
  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="Open-Sans.css">
  <link rel="stylesheet" href="index.css">
  <title></title>
  <script defer="defer" src="./static/js/main.cb41f6a5.js"></script>
  <link href="./static/css/main.4017e162.css" rel="stylesheet">
  <meta name="description"
        content="InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions">
  <title>InterActHuman Project</title>
</head>

<body>
  <div id="root" class="column-flex">
    <div id="title-flex" class="column-flex">
      <h1>  InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions </h1>
      <span>
        Anonymous Authors
        <br />
      </span>
      </span>
      <small><span><b>TL;DR</b>: InterActHuman is a novel diffusion transformer (DiT) based framework for multi-concept audio-driven human video generation that overcomes the traditional single-entity limitation by localizing and aligning multi-modal inputs for each distinct subject. Instead of fusing all conditions globally, the method uses an iterative, in-network mask predictor to infer fine-grained, spatio-temporal layouts for each identity, enabling the precise injection of local cues—such as audio for accurate lip synchronization—into their specific regions during the video synthesis process. Built on a DiT backbone and leveraging a multi-step denoising process, the framework dynamically refines its masks, so that predictions from one step guide local condition injection in the next, ensuring that audio signals are correctly associated with each individual (represented by a reference image). Extensive evaluations demonstrate state-of-the-art performance in key areas like lip-sync accuracy, subject consistency, and overall video quality, while a scalable data pipeline with over 2.6 million annotated video-entity pairs supports its training. Our method supports applications covering audio-driven multi-person video generation and multi-concept video customization such as human-object interaction.</span></small>
      <p class="styled-text">
        <b>*</b> Note that to generate all results on this page, <strong>only text prompts, and N paired {reference image, audio segment} are required</strong>. The lip-sync is natively supported by DiT layers via audio cross-attention, and no post-processing is needed. In our paper, concept means an appearance represented by a reference image, which could be human, animal, background or object. Identity means a specific person where audio condition is applied, which is a subset of concept.
      </p>

      <div class='responsive-image-container'>
        <img src='image/framework.jpg' alt='' />
      </div>
    </div>

    <div id="sections" class="column-flex">
      <h3>Dialogue Videos</h3>
        <p>
          InterActHuman supports multi-person dialogue video generation, where each person is represented by a reference image. Each specific audio segment condition could be applied to a specific person according to user's wish. It is also possible to assign multiple audio segments to the same person in different frames in a video. For example, we have 5 audio segments with different durations, and we have two speakers. The audio assignment could be 1-2-1-2-1, where 1 means speaker 1, and 2 means speaker 2. Some cases' appearance or audio are cropped or trimmed from publicly available videos generated by Veo3. <b>Please refer to 'input_imgs' folder for reference images. All reference images used in our demo are person head images.</b>
        </p>
        <div class="video-slider">
          <video src="video/dialogue/1_watermarked_id.mp4"></video>
          <video src="video/dialogue/2_watermarked_id.mp4"></video>
          <video src="video/dialogue/10_watermarked_id.mp4"></video>
          <video src="video/dialogue/4_watermarked_id.mp4"></video>
          <video src="video/dialogue/5_watermarked_id.mp4"></video>
          <video src="video/dialogue/6_watermarked_id.mp4"></video>
          <video src="video/dialogue/7_watermarked_id.mp4"></video>
          <video src="video/dialogue/8_watermarked_id.mp4"></video>
          <video src="video/dialogue/9_watermarked_id.mp4"></video>
        </div>
        <h3>Human-Object Interaction Videos with Audio</h3>
        <p>
          InterActHuman supports single-person talking video generation with human-object interaction, where each person and object is represented by a reference image. It is worth noting that the affordance of object is implicitly learned from data and controlled by user's text prompt. The audio condition is only applied to the person in our cases, yet audio could also be applied to human-like objects such as a cookie or a cup. It is also possible to apply HOI into multi-person talking video generation, where we don't show results here due to complex input condition preparation. Readers could assume the object is the one holding by the person. For the first case, the object is the car chassis.
        </p>

        <div class="video-slider">
          <video src="video/hoi/9_watermarked_id.mp4"></video>
          <video src="video/better_hoi/9_id.mp4"></video>
          <video src="video/hoi/10_watermarked_id.mp4"></video>
          <video src="video/better_hoi/1_id.mp4"></video>
          <video src="video/hoi/11_watermarked_id.mp4"></video>
          <video src="video/better_hoi/2_id.mp4"></video>
          <video src="video/hoi/12_watermarked_id.mp4"></video>
          <video src="video/better_hoi/3_id.mp4"></video>
          <video src="video/better_hoi/4_id.mp4"></video>
          <video src="video/better_hoi/5_id.mp4"></video>
          <video src="video/better_hoi/6_id.mp4"></video>
          <video src="video/better_hoi/7_id.mp4"></video>
          <video src="video/better_hoi/8_id.mp4"></video>
          <video src="video/better_hoi/10_id.mp4"></video>
          <video src="video/better_hoi/11_id.mp4"></video>
        </div>

        <h3>Domain Diversity</h3>
        <p>In terms of diversity of input domain, InterActHuman supports cartoons, artificial objects, and animals. Videos provided here are generated without audio condition, which could be regarded as showcases from the multi-concept video generation part of InterActHuman.</p>
        <div class="video-slider">
          <video src="video/diversity/5_watermarked_id.mp4"></video>
          <video src="video/diversity/6_watermarked_id.mp4"></video>
          <video src="video/diversity/7_watermarked_id.mp4"></video>
          <video src="video/diversity/8_watermarked.mp4"></video>
          <video src="video/diversity/10_watermarked.mp4"></video>
          <video src="video/diversity/11_watermarked.mp4"></video>
        </div>
      <br/>
      <br/>
    </div>
  </div>
  <script src="index.js"></script>
  <script>
    function comming_soon_click() {
      alert('Comming soon!');
    }
    function TBD_click() {
      alert('TBD');
    }
  </script>
</body>



</html>
