<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models">
  <meta name="keywords" content="LaVie">
<!--  <meta name="viewport" content="width=device-width, initial-scale=2">-->
  <meta name="viewport" content="width=device-width">
  <title>LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-9VZKE74FPW"></script>
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>

<style>
  body {
    max-width: 2400px;
    margin: 0 auto;
  }
</style>

<body>


<section class="hero">
  <div class="hero-body">
<!--    <div class="container is-max-desktop">-->
      <div class="container is-fullhd">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models</h1>
          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <a href="">Paper ID: 2758</a>
              <span class="author-block">
              
          </div>

          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-width">
		
		<h2 class="title is-3">Text-to-Video Generation</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
      <div id="results-carousel" class="carousel results-carousel">

          <!-- 1 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
			  <video width="450" height="300" class="clickplay">
				  <source src="static/videos/12.mp4" type="video/mp4">
			  </video> -->
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Cinematic shot of Van Gogh's selfie, Van Gogh style
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/2.mp4" type="video/mp4">
				  
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A corgi’s head depicted as an explosion of a nebula, high quality
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/7.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A panda drinking coffee in a cafe in Paris
              </div>

          </div>

          <!-- 2 column-->
          <div class="column is-multiline">

               <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/14.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Iron Man flying in the sky
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/5.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A jellyfish floating through the ocean, with bioluminescent tentacles
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/6.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A Mars rover moving on Mars
              </div>
          </div>

          <!-- 3 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/15.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  The bund Shanghai, oil painting
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/3.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A fantasy landscape, trending on artstation, 4k, high resolution
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/8.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A space shuttle launching into orbit, with flames and smoke billowing out from the engines
              </div>

          </div>

          <!-- 4 column-->
          <div class="column is-multiline ">
              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/10.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A super cool giant robot in Cyberpunk city, artstation
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/11.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A tropical beach at sunrise, with palm trees and crystal-clear water in the foreground
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/1.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh
              </div>

          </div>

          <!-- 5 column-->
          <div class="column is-multiline">
              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/13.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Gwen Stacy reading a book
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/4.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A future where humans have achieved teleportation technology
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/9.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A steam train moving on a mountainside by Vincent van Gogh
              </div>

          </div>

          <!-- 6 column-->
          <div class="column is-multiline">
              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/16.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Yoda playing guitar on the stage
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/17.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/27.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background
              </div>
          </div>

          <!-- 6 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/21-2.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A cat eating food out of a bowl, in style of Van Gogh
              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/19.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh

              </div>

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos/46.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Vincent van Gogh is painting in the room
              </div>

          </div>

      </div>
    </div>
  </div>
</div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-maxwidth">
		
		<h2 class="title is-3">Long Video Generation</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
      <div id="results-carousel" class="carousel results-carousel">

          <!-- 1 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
			  <video width="450" height="300" class="clickplay">
				  <source src="static/videos_long/panda.mp4" type="video/mp4">
			  </video> -->
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A panda playing guitar near a campfire, snow mountain in the background.
              </div>

          </div>

          <!-- 2 column-->
          <div class="column is-multiline">

               <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos_long/car.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  A car moving on an empty street, rainy evening, Van Gogh painting.
              </div>

          </div>

          <!-- 3 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos_long/einstain.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Albert Einstein is reading a paper.
              </div>
          </div>

          <!-- 6 column-->
          <div class="column is-multiline">
              <div class="item item-toby">
              </div>
          </div>

      </div>
    </div>
  </div>
</div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-maxwidth">
		
		<h2 class="title is-3">Personalized T2V Generation</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
	  <img width="300" height="300" src="static/videos_personalized/1.jpg">	  
	  <img width="300" height="300" src="static/videos_personalized/2.jpg">
	  <img width="300" height="300" src="static/videos_personalized/3.jpg">
	  <p>Training image samples</p>  
		
      <div id="results-carousel" class="carousel results-carousel">

          <!-- 1 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
			  <video width="450" height="300" class="clickplay">
				  <source src="static/videos_personalized/2.mp4" type="video/mp4">
			  </video> 
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Misaka Mikoto walking in the city.
              </div>

          </div>

          <!-- 2 column-->
          <div class="column is-multiline">

               <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos_personalized/4.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Misaka Mikoto walking in the space.
              </div>

          </div>

          <!-- 3 column-->
          <div class="column is-multiline">

              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos_personalized/3.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Misaka Mikoto walking in the space.
              </div>


          </div>

          <!-- 6 column-->
          <div class="column is-multiline">
              <div class="item item-toby">
              <video width="450" height="300" class="clickplay">
				  <source src="static/videos_personalized/1.mp4" type="video/mp4">
			  </video>
              </div>
              <div class="content" style="font-family: Arial; font-style: italic">
                  Misaka Mikoto.
              </div>
          </div>

      </div>
    </div>
  </div>
</div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-maxwidth">
		
		<h2 class="title is-3">Vimeo25M samples</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
    <div id="results-carousel" class="carousel results-carousel">

      <!-- 1 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_25m/a bride and groom walk down the aisle of a church with people in.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A bride and groom walk down the aisle of a church with people in.
          </div>

      </div>

      <!-- 2 column-->
      <div class="column is-multiline">

           <div class="item item-toby">
          <video width="568" height="320" class="clickplay">
      <source src="static/videos_25m/a man standing at a podium giving a speech.mp4" type="video/mp4">
    </video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A man standing at a podium giving a speech.
          </div>

      </div>

      <!-- 3 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
          <video width="568" height="320" class="clickplay">
      <source src="static/videos_25m/a sunset with clouds in the sky.mp4" type="video/mp4">

    
    </video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A sunset with clouds in the sky.
          </div>


      </div>

      <!-- 6 column-->
      <div class="column is-multiline">
          <div class="item item-toby">
          <video width="752" height="320" class="clickplay">
      		<source src="static/videos_25m/an aerial view of a large estate.mp4" type="video/mp4">
    		</video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            An aerial view of a large estate.
          </div>
      </div>

  </div>
  
</div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-maxwidth">
		
		<h2 class="title is-3", color="red">Comparison (Joint image-video fine-tuning)</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
    <div id="results-carousel" class="carousel results-carousel">

      <!-- 1 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/ironman_joint.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            Iron Man dancing on the beach (joint image-video fine-tuning)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/teddy_joint.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear playing in the water (joint image-video fine-tuning)
          </div>

      </div>

      <!-- 2 column-->
      <div class="column is-multiline">

           <div class="item item-toby">
          <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/ironman_failure_1.mp4" type="video/mp4">
    </video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            Iron Man dancing on the beach (only video fine-tuning)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/teddy_failure_1.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear playing in the water (only video fine-tuning)
          </div>

      </div>

      <!-- 3 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
          <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/ironman_failure_2.mp4" type="video/mp4">

    
    </video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            Iron Man dancing on the beach (only video fine-tuning)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_compare_joint/teddy_failure_2.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear playing in the water (only video fine-tuning)
          </div>

      </div>

      <!-- 6 column-->
      <div class="column is-multiline">
          <div class="item item-toby">
          <video width="752" height="320" class="clickplay">
      		<source src="static/videos_compare_joint/ironman_failure_3.mp4" type="video/mp4">
    		</video>
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            Iron Man dancing on the beach (only video fine-tuning)
          </div>
		  
		  <div class="item item-toby">
		      <video width="568" height="320" class="clickplay">
		        <source src="static/videos_compare_joint/teddy_failure_3.mp4" type="video/mp4">
		      </video> 
		            </div>
		            <div class="content" style="font-family: Arial; font-style: italic">
		              A teddy bear playing in the water (only video fine-tuning)
		            </div>
		  
      </div>

  </div>
  
</div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
	  <div class="columns is-centered has-text-centered">
    <div class="container is-max-maxwidth">
		
		<h2 class="title is-3", color="red">Comparison (RoPE)</h2>
		<h2 class="title is-5">(Click image to play video)</h2>
		
    <div id="results-carousel" class="carousel results-carousel">

      <!-- 1 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/teddy_16.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear walking on the street. (16 frames)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/panda_16.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A panda taking selfie. (16 frames)
          </div>

      </div>

      <!-- 2 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/teddy_32.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear walking on the street. (32 frames)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/panda_32.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A panda taking selfie. (32 frames)
          </div>

      </div>

      <!-- 3 column-->
      <div class="column is-multiline">

          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/teddy_48.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear walking on the street. (48 frames)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/panda_48.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A panda taking selfie. (48 frames)
          </div>

      </div>

      <!-- 6 column-->
      <div class="column is-multiline">
		  
          <div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/teddy_64.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A teddy bear walking on the street. (64 frames)
          </div>

<div class="item item-toby">
    <video width="568" height="320" class="clickplay">
      <source src="static/videos_rope/panda_64.mp4" type="video/mp4">
    </video> 
          </div>
          <div class="content" style="font-family: Arial; font-style: italic">
            A panda taking selfie. (64 frames)
          </div>
		  
      </div>

  </div>
  
</div>
</section>


<section class="section">
  <div class="container is-fullhd">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
              We present LaVie, an innovative system for text-to-video (T2V) generation that operates on the foundation of a cascade of video latent diffusion models. Comprising three networks, namely a base T2V model, a temporal interpolation model, and a video super-resolution model, LaVie aims to address research question of extending a pre-trained text-to-image (T2I) model into realm of video synthesis. Our objective is to accomplish the synthesis of visually realistic and temporally coherent videos while preserving the strong compositional nature of the model. We found that the incorporation of simple temporal self-attentions, coupled with relative positional encoding, adequately captures the temporal correlations inherent in video data. Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality outcomes. To enhance the performance of LaVie, we curate a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Experimental evaluations demonstrate that LaVie outperforms state-of-the-art in terms of quantitative and qualitative assessments. Furthermore, we showcase the versatility of pre-trained LaVie models in long video generation and personalized video synthesis applications.
          </p>
        </div>
      </div>
    </div>

</section>


<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website is licensed under a <a rel="license"
                                                href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>
          <p>
            Website adapted from the following <a rel="license"
                                                href="https://github.com/nerfies/nerfies.github.io">source code</a>.
          </p>
        </div>
      </div>
    </div>
  </div>
</footer>

</body>



<script>
var videos = document.getElementsByClassName("clickplay");
for (var i = 0; i < videos.length; i++) {
  videos[i].addEventListener("click", function() {
    this.play();
  });
  videos[i].addEventListener("ended", function() {
    this.pause();
    this.currentTime = 0;
  });
}
</script>

</html>
