<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="description"
    content="Our model, Puppet-Master, can synthesize nuanced part-level object dynamics. Trained on synthetic data, at inference time it generalizes to real data.">
  <meta name="keywords" content="Puppet-Master, Motion, Diffusion Models">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics</title>

  <meta property="og:image" content="resources/og_image" />
  <meta property="og:title"
    content="Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics" />
  <meta property="og:description"
    content="Our model, Puppet-Master, can synthesize nuanced part-level object dynamics. Trained on synthetic data, at inference time it generalizes to real data." />
  <!-- Twitter automatically scrapes this. Go to https://cards-dev.twitter.com/validator?
      if you update and want to force Twitter to re-scrape. -->
  <meta property="twitter:card" content="summary" />
  <meta property="twitter:title"
    content="Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics" />
  <meta property="twitter:description"
    content="Our model, Puppet-Master, can synthesize nuanced part-level object dynamics. Trained on synthetic data, at inference time it generalizes to real data." />
  <!-- <meta property="twitter:image" content="resources/og_image" /> -->

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-VFNFH9CKNX"></script>
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-VFNFH9CKNX');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">

  <script type="module" src="https://unpkg.com/@google/model-viewer/dist/model-viewer.min.js"></script>
  <script type="text/javascript" src="https://code.jquery.com/jquery-1.11.0.min.js"></script>
  <script type="text/javascript" src="https://code.jquery.com/jquery-migrate-1.2.1.min.js"></script>
  <script src="https://unpkg.com/interactjs/dist/interact.min.js"></script>

  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
  <!-- <script src="./static/js/index.js"></script> -->
</head>

<body>

  <section class="hero">
    <div class="hero-body">
      <div class="container is-max-desktop">
        <div class="columns is-centered">
          <div class="column has-text-centered">
            <h1 class="title is-3 publication-title" style="font-size: 2.2rem;">
              Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics</h1>
            <div class="is-size-5 publication-authors">
              <span class="author-block">
                <a href="index.html">Anonymous Authors</a></span>
            </div>

          </div>
        </div>
      </div>
    </div>
  </section>

  <section class="section">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h2 class="title is-3">TL;DR</h2>
        </div>
      </div>

      <div class="content has-text-justified">
        <p>
          <b>Puppet-Master</b> generates realistic <i>part-level</i> motion in the form of a video, conditioned on an
          input image and a few drag interactions, serving as a motion prior for part-level dynamics.
        </p>
      </div>
    </div>
  </section>

  <section class="section">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h2 class="title is-3">Examples</h2>
        </div>
      </div>

      <div class="content has-text-centered is-centered">
        <p>
          All our examples are generated with a <b>single</b> model!
        </p>
      </div>

      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h3 class="title is-4">Man-Made Objects</h3>
          <div class="content has-text-centered">
            <img src='resources/manmade.gif' width="1200">
          </div>
          <br>
          <h3 class="title is-4">Animals</h3>
          <div class="content has-text-centered">
            <img src='resources/animal.gif' width="1200">
          </div>
          <br>
          <h3 class="title is-4">Humans</h3>
          <div class="content has-text-centered">
            <img src='resources/human.gif' width="1200">
          </div>
          <br>
        </div>
      </div>
  </section>

  <section class="section">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h2 class="title is-3">Abstract</h2>
          <div class="content has-text-justified">
            <p>
              We present <b>Puppet-Master</b>, an interactive video generative model that can serve as a motion prior
              for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e.,
              drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given
              drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for
              which we propose a new conditioning architecture to effectively inject the dragging control. More
              importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely
              adopted spatial attention modules, which significantly improves generation quality by addressing the
              appearance and background issues in existing models. Unlike other motion-conditioned video generators that
              are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from
              Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to
              automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful
              motion trajectories. Puppet-Master generalizes well to real images across various categories, and
              outperforms existing methods in a zero-shot manner on a real-world benchmark.
            </p>
          </div>
        </div>
      </div>
      <br>
    </div>
  </section>

  <section class="section">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h2 class="title is-3">Technical Details</h2>
        </div>
      </div>

      <!-- Architecture -->
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h3 class="title is-4">Architecture</h3>
          <td style="padding:20px;width:35%;vertical-align:middle">
            <img src='resources/architecture.png' width="1000">
          </td>
        </div>
      </div>
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <div class="content has-text-justified">
            <p>
              <b>Puppet-Master</b> is built on Stable Video Diffusion (SVD). To enable precise drag conditioning, we
              first modify the original latent video diffusion architecture by <b>(A)</b> adding adaptive layer
              normalization modules to modulate the internal diffusion features and <b>(B)</b> adding cross attention
              with <i>drag tokens</i>. Furthermore, to ensure high-quality appearance and background, we introduce
              <b>(C)</b> <i>all-to-first</i> spatial attention, a drop-in replacement for the spatial self-attention
              modules, which attends every noised video frame with the first frame.
            </p>
          </div>
        </div>
      </div>

      <!-- Data -->
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <h3 class="title is-4">Data</h3>
          <td style="padding:20px;width:35%;vertical-align:middle">
            <img src='resources/data.png' width="1000">
          </td>
        </div>
      </div>
      <div class="columns is-centered has-text-centered">
        <div class="column is-full-width">
          <div class="content has-text-justified">
            <p>
              <b>Puppet-Master</b> is trained on a combined dataset of Drag-a-Move and <b>Objaverse-Animation-HQ</b>. We
              first curate high-quality animated 3D models from Objaverse in two steps: in the first step, we extract
              important features for each animation and fit a random forest classifier to decide whether an animation
              should be included in the training set, yielding a dataset of 16K animated models which we dub
              <b>Objaverse-Animation</b>; in the second step, we prompt GPT-4V to further filter out sub-optimal
              animations, yielding a higher-quality dataset <b>Objaverse-Animation-HQ</b> consisting of 10K animated
              models.
              We augment the animated models with meaningful sparse motion trajectories to train our drag-conditioned
              video generator.
              We empirically show that training on Objaverse-Animation-HQ leads to a much better model than training on
              Objaverse-Animation, justifying the extra efforts in data curation.
            </p>
          </div>
        </div>
      </div>
    </div>
  </section>

  <script>
    initComparisons();
    window.addEventListener('resize', resetComparisons);
    linkVideos(0);
    linkVideos(1);
  </script>
  <script type="text/javascript" src="./static/js/slick.min.js"></script>
  <script src="./static/js/main.js"></script>

</body>

</html>
