<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="LEO: Generative Latent Image Animator for Human Video Synthesis">
  <meta name="keywords" content="LEO">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>LEO: Generative Latent Image Animator for Human Video Synthesis</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-9VZKE74FPW"></script>
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>


<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">LEO: Generative Latent Image Animator for Human Video Synthesis</h1>
          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <a href="">ID: 1248</a>
            <span class="author-block">
          </div>

          
        </div>
      </div>
    </div>
  </div>
</section>


<section class="hero teaser">
  <div class="container is-max-desktop">
    <div class="hero-body">
		<center>
      		<img src="static/images/model.png" title="" style="max-width:80%;vertical-align:top"/>
		</center>
    </div>
  </div>
</section>

<section class="hero teaser">
  <div class="container is-max-desktop">
    <div class="hero-body">
		<center>
      		<video width="580" autoplay controls loop><source src="static/videos/taichi256.mp4" type="video/mp4"></video>
		</center>

      <h2 class="subtitle has-text-centered">
        Video generation & editing with <span class="dnerf">LEO</span>
      </h2>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
			Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing.
          </p>
        </div>
      </div>
    </div>

</section>


<section class="section">
  <div class="container is-max-desktop">
    

    <!-- Video Generation -->
    <div class="columns is-centered">
      <div class="column is-full-width">
        <h2 class="title is-3">Video Generation</h2>

        <!-- Interpolating. -->
        <h3 class="title is-4">Unconditional generation</h3>
        <div class="content has-text-justified">
          <p>
            Unconditional video generation on Taichi-HD (128 x 128 and 256 x 256) and FaceForensics (256 x 256) datasets. 
          </p>
        </div>
        <div class="content has-text-centered">
	  	<center>
	  		<video width="300" height="300" autoplay controls loop><source src="static/videos/grid-128res-100.mp4" type="video/mp4"></video>
	  		<video width="300" height="300" autoplay controls loop><source src="static/videos/grid256.mp4" type="video/mp4"></video>
	  		<video width="300" height="300" autoplay controls loop><source src="static/videos/grid-ffs256.mp4" type="video/mp4"></video>
	  	</center>
        </div>
        <br/>
        <!--/ Interpolating. -->

        <!-- Re-rendering. -->
        <h3 class="title is-4">Conditional generation based on the first frame</h3>
        <div class="content has-text-justified">
          <p>
            Given the first frame, LEO is able to generate the following sequences. Results are shown on Taichi-HD (128 x 128 and 256 x 256), FaceForensics (256 x 256) and CelebV (256 x 256) datasets.  
          </p>
        </div>
        <div class="content has-text-centered">
	  	<center>
	  	<video width="450" height="450" autoplay controls loop><source src="static/videos/grid-128res-100-real.mp4" type="video/mp4"></video>
	  	<video width="450" height="450" autoplay controls loop><source src="static/videos/grid-256-real.mp4" type="video/mp4"></video>
	  	<video width="450" height="450" autoplay controls loop><source src="static/videos/grid-ffs256-real.mp4" type="video/mp4"></video>
	  	<video width="450" height="450" autoplay controls loop><source src="static/videos/grid-celebv-real-2.mp4" type="video/mp4"></video>
	  	</center>
        </div>
        <!--/ Re-rendering. -->
		
		
        <!-- Re-rendering. -->
        <h3 class="title is-4">Long video generation</h3>
        <div class="content has-text-justified">
          <p>
            Results are shown by using LEO to generate long videos (1024 frames).
          </p>
        </div>
        <div class="content has-text-centered">
		<center>
		<video width="450" autoplay controls loop><source src="static/videos/taichi256-long.mp4" type="video/mp4"></video>
		<video width="450" height="450" autoplay controls loop><source src="static/videos/ffs-long.mp4" type="video/mp4"></video>
		</center>
        </div>
        <!--/ Re-rendering. -->

        <!-- Re-rendering. -->
        <h3 class="title is-4">Appearance and Motion Disentanglement</h3>
        <div class="content has-text-justified">
          <p>
            LEO is able to disentangle appearance and motion. (Left) same motion, differrent appearance. (Right) same appearance, different motion.
          </p>
        </div>
        <div class="content has-text-centered">
		<center>
		<video width="450" autoplay controls loop><source src="static/videos/grid-same-motion.mp4" type="video/mp4"></video>
		<video width="450" height="450" autoplay controls loop><source src="static/videos/monalisa-long.mp4" type="video/mp4"></video>
		</center>
        </div>
        <!--/ Re-rendering. -->
		
	    <div class="columns is-centered">
	      <!-- Video Editing -->
	      <div class="column">
	        <div class="content">
	          <h2 class="title is-3">Video Editing</h2>
	          <p>
	            By combining LEO and ControlNet, generated videos can be editted by only modifying the first frame.
	          </p>
			<center>
				<video width="450" autoplay controls loop><source src="static/videos/taichi128.mp4" type="video/mp4"></video>
				<video width="450" autoplay controls loop><source src="static/videos/celebv-hq.mp4" type="video/mp4"></video>
			</center>
			<center>
				<video width="450" autoplay controls loop><source src="static/videos/ffs1.mp4" type="video/mp4"></video>
				<video width="450" autoplay controls loop><source src="static/videos/ffs2.mp4" type="video/mp4"></video>
			</center>
	        </div>
	      </div>
	      <!--/ Video Editing -->
	    </div>
		
	    <div class="columns is-centered">
	      <!-- Video Editing -->
	      <div class="column">
	        <div class="content">
	          <h2 class="title is-3">Comparison</h2>
	          <p>
	            Comparison with SOTA.
	          </p>
			<center>
				<video width="800" autoplay controls loop><source src="static/videos/digan-grid.mp4" type="video/mp4"></video>
				<p>DIGAN (Taichi)</p>
				<video width="800" autoplay controls loop><source src="static/videos/tats-grid.mp4" type="video/mp4"></video>
				<p>TATS (Taichi)</p>
				<video width="800" autoplay controls loop><source src="static/videos/lvdm-grid.mp4" type="video/mp4"></video>
				<p style="color:red;">LVDM (Taichi)</p>
				<video width="800" autoplay controls loop><source src="static/videos/videofusion-grid.mp4" type="video/mp4"></video>
				<p style="color:red;">VideoFusion (T2V, prompt: a man playing Taichi)</p>
				<video width="800" autoplay controls loop><source src="static/videos/taichi_compare.mp4" type="video/mp4"></video>
				<p>Ours (Taichi)</p>
			</center>
			<center>
				<video width="800" autoplay controls loop><source src="static/videos/styleganv-grid.mp4" type="video/mp4"></video>
				<p>StyleGAN-V (FaceForensics)</p>
				<video width="800" autoplay controls loop><source src="static/videos/mostgan-grid.mp4" type="video/mp4"></video>
				<p style="color:red;">MoStGAN-V (FaceForensics)</p>
				<video width="800" autoplay controls loop><source src="static/videos/ffs_compare.mp4" type="video/mp4"></video>
				<p>Ours (FaceForensics)</p>
			</center>
			<center>
				<video width="800" autoplay controls loop><source src="static/videos/tats-long-grid.mp4" type="video/mp4"></video>
				<p>TATS-long (Taichi long video)</p>
				<video width="800" autoplay controls loop><source src="static/videos/taichi_long_compare.mp4" type="video/mp4"></video>
				<p>Ours (Taichi long video)</p>
			</center>
			
          <p style="color:red;">
            Compare with ControlVideo, Text2Video-zero
          </p>
		  <center>
		<video width="800" autoplay controls loop><source src="static/videos/ffs1-more.mp4" type="video/mp4"></video>
		<p style="color:red;">From left to right: Real, ours, ControlVideo (edge), Text2Video-zero (edge)</p>
		<video width="800" autoplay controls loop><source src="static/videos/ffs2-more.mp4" type="video/mp4"></video>
		<p style="color:red;">From left to right: Real, ours, ControlVideo (edge), Text2Video-zero (edge)</p>
		<video width="800" autoplay controls loop><source src="static/videos/taichi-more.mp4" type="video/mp4"></video>
		<p style="color:red;">From left to right: Real, ours, ControlVideo (skeleton), Text2Video-zero (skeleton)</p>
	</center>
        <p style="color:red;">
          Compare with StyleTalk
        </p>
		<center>
		<video width="400" autoplay controls loop><source src="static/videos/obama1.mp4" type="video/mp4"></video>
		<p style="color:red;">Left: Ours, Right: StyleTalk (3DMM)</p>
		</center>
		<center>
		<video width="400" autoplay controls loop><source src="static/videos/obama2.mp4" type="video/mp4"></video>
		<p style="color:red;">Left: Ours, Right: StyleTalk (3DMM)</p>
		</center>
		
	        </div>
	      </div>
	      <!--/ Video Editing -->
	    </div>
      </div>
    </div>
	
    <div class="columns is-centered">
      <!-- Video Editing -->
      <div class="column">
        <div class="content">
          <h2 style="color:red;" class="title is-3">Failure cases</h2>
          <p style="color:red;">
            We showcase two failure cases from our appraoch.
          </p>
		<center>
			<video width="450" autoplay controls loop><source src="static/videos/failure2.mp4" type="video/mp4"></video>
		</center>
        </div>
      </div>
      <!--/ Video Editing -->
    </div>
	
    <div class="columns is-centered">
      <!-- Video Editing -->
      <div class="column">
        <div class="content">
          <h2 style="color:red;" class="title is-3">Comparison with MRAA and FOMM</h2>
          <p style="color:red;">
            We showcase comparison with Siarohin et al. 2019; 2021
          </p>
		<center>
			<video width="900" autoplay controls loop><source src="static/videos/compare_face.mp4" type="video/mp4"></video>
		</center>
        </div>
      </div>
      <!--/ Video Editing -->
    </div>
	
  </div>
</section>


<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website is licensed under a <a rel="license"
                                                href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>
          <p>
            Website adapted from the following <a rel="license"
                                                href="https://github.com/nerfies/nerfies.github.io">source code</a>.
          </p>
        </div>
      </div>
    </div>
  </div>
</footer>

</body>
</html>
