<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="LAVITA: Latent Video Diffusion Models With Spatio-temporal Transformers">
  <meta name="keywords" content="LEO">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>LAVITA: Latent Video Diffusion Models With Spatio-temporal Transformers</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-9VZKE74FPW"></script>
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>


<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">LAVITA: Latent Video Diffusion Models with Spatio-temporal Transformers</h1>
          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <a href="">ID: 2376</a>
            <span class="author-block">
          </div>

          
        </div>
      </div>
    </div>
  </div>
</section>


<section class="hero teaser">
  <div class="container is-max-desktop">
    <div class="hero-body">
		<center>
      		<img src="static/images/architecture.svg" title="" style="max-width:80%;vertical-align:top"/>
		</center>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
			Video generation is a challenging task as it requires effective modeling of rich spatio-temporal information from high-dimensional video data. To tackle this challenge, we propose a novel architecture, the LAtent VIdeo diffusion model with spatio-temporal TrAnsformers, referred to as LAVITA, which integrates the Transformer architecture into diffusion models for the first time within the realm of video generation. Conceptually, LATIVA models spatial and temporal information separately to accommodate their inherent disparities as well as to reduce the computational complexity. Following this design strategy, we design several Transformer-based model variants to integrate spatial and temporal information harmoniously. Moreover, we identify the best practices in architectural choices and learning strategies for LAVITA through rigorous empirical analysis. Our comprehensive evaluation demonstrates that LAVITA achieves state-of-the-art performance across several standard video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, outperforming current best models. We strongly believe that LAVITA provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.
          </p>
        </div>
      </div>
    </div>

</section>


<section class="section">
  <div class="container is-max-desktop">
    

    <!-- Video Generation -->
    <div class="columns is-centered">
      <div class="column is-full-width">
        <!-- <h2 class="title is-3">Video Generation</h2> -->

        <!-- Interpolating. -->
        <h3 class="title is-4">Unconditional generation</h3>
        <div class="content has-text-justified">
          <p>
            Unconditional video generation on the Taichi-HD (256 x 256), FaceForensics (256 x 256) and SkyTimelapse (256 x 256) datasets. 
          </p>
        </div>
        <div class="content has-text-centered">
	  	<center>
	  		<video width="500" height="500" autoplay controls loop><source src="static/videos/taichi-grid.mp4" type="video/mp4"></video>
	  		<video width="500" height="500" autoplay controls loop><source src="static/videos/ffs-grid.mp4" type="video/mp4"></video>
	  		<video width="500" height="500" autoplay controls loop><source src="static/videos/sky-grid.mp4" type="video/mp4"></video>
	  	</center>
        </div>
        <br/>
        <!--/ Interpolating. -->

        <!-- Re-rendering. -->
        <h3 class="title is-4">Conditional generation based on classes</h3>
        <div class="content has-text-justified">
          <p>
            Given the class, LAVITA is able to generate the desired videos. Results are shown on the UCF101 (256 x 256) datasets.  
          </p>
        </div>
        <div class="content has-text-centered">
	  	<center>
	  	<video width="450" height="450" autoplay controls loop><source src="static/videos/ucf-grid.mp4" type="video/mp4"></video>
	  	</center>
        </div>
        <!--/ Re-rendering. -->
		
		
        <!-- Re-rendering. -->
        <h3 class="title is-4">Conditional generation based on prompts.</h3>
        <div class="content has-text-justified">
          <p>
            Results are shown by using LAVITA to generate disered videos. Results are shown on the Webv2m datasets and a subset of Laion5B (comprising approximately 6,400,000 images).
          </p>
        </div>
        <div class="content has-text-centered">
		<center>
		<video width="500" height="500" autoplay controls loop><source src="static/videos/webv2m-004-grid.mp4" type="video/mp4"></video>
          <p>
            Decorate with pineapple sweet cake roll.
          </p>
		<video width="500" height="500" autoplay controls loop><source src="static/videos/webv2m-028-grid.mp4" type="video/mp4"></video>
          <p>
            Reeds in the wind, razim lake, romania.
          </p>
		<video width="500" height="500" autoplay controls loop><source src="static/videos/webv2m-100-grid.mp4" type="video/mp4"></video>
          <p>
            Slow pan upward of blazing oak fire in an indoor fireplace.
          </p>
		<video width="500" height="500" autoplay controls loop><source src="static/videos/webv2m-154-grid.mp4" type="video/mp4"></video>
          <p>
            Flight over the country.
          </p>
		<video width="500" height="500" autoplay controls loop><source src="static/videos/webv2m-158-grid.mp4" type="video/mp4"></video>
          <p>
            Sunset over the sea.
          </p>
		</center>
        </div>

        <!-- Re-rendering. -->
        <h3 class="title is-4">Compare with other state-of-the-arts.</h3>
        <div class="content has-text-justified">
          <p>
            Visual comparison with other state-of-the-arts on UCF101, Taichi-HD, FaceForensics and SkyTimelapse datasets, respectively.
          </p>
        </div>
        <h3 class="title is-5">UCF101</h3>
        <!-- <div class="content has-text-centered"> -->
        <div class="content has-text-centered">
        <video width="800" height="500" autoplay controls loop><source src="static/videos/ucf-pvdm-grid.mp4" type="video/mp4"></video>
          <p>
            PVDM
          </p>
		<video width="800" height="500" autoplay controls loop><source src="static/videos/ucf-ours-grid.mp4" type="video/mp4"></video>
          <p>
            Ours
          </p>
        </div>

        <h3 class="title is-5">Taichi-HD</h3>
        <div class="content has-text-centered">
        <video width="800" height="500" autoplay controls loop><source src="static/videos/taichi-digan-grid.mp4" type="video/mp4"></video>
          <p>
            DIGAN
          </p>
		<video width="800" height="500" autoplay controls loop><source src="static/videos/taichi-pvdm-grid.mp4" type="video/mp4"></video>
          <p>
            PVDM
          </p>
        <video width="800" height="500" autoplay controls loop><source src="static/videos/taichi-ours-grid.mp4" type="video/mp4"></video>
          <p>
            Ours
          </p>
        </div>

        <h3 class="title is-5">FaceForensics</h3>
        <div class="content has-text-centered">
            <video width="800" height="500" autoplay controls loop><source src="static/videos/ffs-stylegan-v-grid.mp4" type="video/mp4"></video>
              <p>
                StyleGAN-V
              </p>
            <video width="800" height="500" autoplay controls loop><source src="static/videos/ffs-pvdm-grid.mp4" type="video/mp4"></video>
              <p>
                PVDM
              </p>
            <video width="800" height="500" autoplay controls loop><source src="static/videos/ffs-ours-grid.mp4" type="video/mp4"></video>
              <p>
                Ours
              </p>
            </div>

        <h3 class="title is-5">SkyTimelapse</h3>
        <div class="content has-text-centered">
            <video width="800" height="500" autoplay controls loop><source src="static/videos/sky-stylegan-v-grid.mp4" type="video/mp4"></video>
              <p>
                StyleGAN-V
              </p>
            <video width="800" height="500" autoplay controls loop><source src="static/videos/sky-pvdm-grid.mp4" type="video/mp4"></video>
              <p>
                PVDM
              </p>
            <video width="800" height="500" autoplay controls loop><source src="static/videos/sky-ours-grid.mp4" type="video/mp4"></video>
              <p>
                Ours
              </p>
            </div>
        <div class="content has-text-centered">
		
        
		
        </div>
        
</section>


<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website is licensed under a <a rel="license"
                                                href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>
          <p>
            Website adapted from the following <a rel="license"
                                                href="https://github.com/nerfies/nerfies.github.io">source code</a>.
          </p>
        </div>
      </div>
    </div>
  </div>
</footer>

</body>
</html>
