<!DOCTYPE html>
<html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation</title>
<meta name="generator" content="Jekyll v3.9.2">
<meta property="og:title" content="Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation">
<meta property="og:locale" content="en_US">
<meta property="og:site_name" content="Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation">
<meta property="og:type" content="website">
<meta name="twitter:card" content="summary">
<meta property="twitter:title" content="Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation">
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"WebSite","headline":"Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation","name":"Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation"}</script>
<!-- End Jekyll SEO tag -->

    <style class="anchorjs"></style><link rel="stylesheet" href="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/style.css">
    <!-- start custom head snippets, customize with your own _includes/head-custom.html file -->

<!-- Setup Google Analytics -->



<!-- You can set your favicon here -->
<!-- link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" -->

<!-- end custom head snippets -->

  </head>
  <body>
    <div class="container-lg px-3 my-5 markdown-body">
      

      <h1 align="center"> Masked Conditional Video Diffusion for <br> Prediction, Generation, and Interpolation </h1>

<p>&nbsp;</p>

<h3 align="center" id=""> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/MaskCondVideoDiffFigure.svg" alt="Overview" width="70%"></h3>

<h3 align="center" id="summary"> Summary</h3>

<ul>
  <li>General purpose model for video generation, forward/backward prediction, and interpolation</li>
  <li>Uses a <a href="https://yang-song.github.io/blog/2021/score/">score-based diffusion loss function</a> to generate novel frames</li>
  <li>Injects Gaussian noise into the current frames and denoises them conditional on past and/or future frames</li>
  <li>Randomly <em>masks</em> past and/or future frames during training which allows the model to handle the four cases:
    <ul>
      <li>Unconditional Generation : both past and future are unknown</li>
      <li>Future Prediction : only the past is known</li>
      <li>Past Reconstruction : only the future is known</li>
      <li>Interpolation : both past and present are known</li>
    </ul>
  </li>
  <li>Uses a <a href="https://arxiv.org/abs/2006.11239">2D convolutional U-Net</a> instead of a complex 3D or recurrent or transformer architecture</li>
  <li>Conditions on past and future frames through concatenation or space-time adaptive normalization</li>
  <li>Produces high-quality and diverse video samples</li>
  <li>Trains with only 1-4 GPUs</li>
  <li>Scales well with the number of channels, and could be scaled much further than in the paper</li>
</ul>

<h3 align="center" id="abstract"> Abstract</h3>

<p style="text-align: justify">Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction – when only future/past frames are masked; unconditional generation – when both past and future frames are masked; and interpolation – when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs.</p>

<p>&nbsp;</p>

<h1 align="center"> Video Prediction </h1>

<p>First, we use real <code class="language-plaintext highlighter-rouge">past</code> frames to predict <code class="language-plaintext highlighter-rouge">current</code> frames. Then, we autoregressively predict the next <code class="language-plaintext highlighter-rouge">current</code> frames using the last predicted frames as the new <code class="language-plaintext highlighter-rouge">past</code> frames (free-running):</p>

<h3 align="center" id="-1"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/autoregressive2.svg" alt="autoregressive" width="50%"> </h3>

<ul>
  <li><em>left column (with frame number)</em> : real image</li>
  <li><em>right column</em> : predicted image</li>
</ul>

<h3 id="kth-64x64">KTH (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=10, <code class="language-plaintext highlighter-rouge">current</code>=5, autoregressive <code class="language-plaintext highlighter-rouge">pred</code>=20</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/KTH_big_c10t5_SPADE.gif" alt="KTH_big_c10t5_SPADE" title="KTH pred c10t5"></p>

<p>&nbsp;</p>

<h3 id="bair-64x64">BAIR (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=2, <code class="language-plaintext highlighter-rouge">current</code>=5, autoregressive <code class="language-plaintext highlighter-rouge">pred</code>=28</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/bair64_big192_5c2_unetm_spade_videos_390000.gif" alt="BAIR_big_c2t5_SPADE" title="BAIR pred c2t5"></p>

<p>&nbsp;</p>

<h3 id="cityscapes-128x128">Cityscapes (128x128)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=2, <code class="language-plaintext highlighter-rouge">current</code>=5, autoregressive <code class="language-plaintext highlighter-rouge">pred</code>=28</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/city32_big192_5c2_unetm_long_75_half.gif" alt="city32_big192_5c2_unetm_long_75_half" title="Cityscapes pred c2t5">
Note that some Cityscapes videos contain brightness changes, which may explain the brightness change in our fake samples, but it is definitively overrepresented in the fake data. More parameters would needed to fix this problem (beyond what we can achieve with our 4 GPUs).
&nbsp;</p>

<h3 align="center" id="-2"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/Cityscapes_arrow.svg" alt="Cityscapes_arrow"> </h3>

<p>Our approach generates high quality frames many steps into the future: Given the two conditioning frames from the <a href="https://www.cityscapes-dataset.com/">Cityscapes</a> validation set (top left), we show 7 predicted future frames in row 2 below, then skip to frames 20-28, autoregressively predicted in row 4. Ground truth frames are shown in rows 1 and 3. Notice the initial large arrow advancing and passing under the car. At frame 20 (the far left of the 3rd and 4th row), the initially small and barely visible second arrow in the background of the conditioning frames has advanced into the foreground.</p>

<p>&nbsp;</p>

<h3 id="stochastic-moving-mnist-64x64">Stochastic Moving MNIST (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=5, <code class="language-plaintext highlighter-rouge">current</code>=5, autoregressive <code class="language-plaintext highlighter-rouge">pred</code>=20</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/SMMNIST_big_c5t5_SPADE_videos_300000.gif" alt="SMMNIST_big_c5t5_SPADE" title="SMMNIST pred c5t5"></p>

<p>In SMMNIST, when two digits overlap during 5 frames, a model conditioning on 5 previous frames will have to guess what those numbers were before overlapping, so they may change randomly. This would be fixed by using a large number of conditioned previous frames. We used 5 to match previous prediction baselines, which start from 5 frames.</p>

<p>&nbsp;</p>

<h1 align="center"> Video Generation </h1>

<h3 id="kth-64x64-1">KTH (64x64)</h3>

<h3 align="center" id="-3"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/KTH_gen_big_c10t5f5_SPADE_videos_100000.gif" alt="KTH gen c10t5f5"> </h3>

<p>&nbsp;</p>

<h3 id="bair-64x64-1">BAIR (64x64)</h3>

<h3 align="center" id="-4"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/bair64_gen_big192_5c2_pmask50_unetm_spade_videos_400000.gif" alt="BAIR gen c2t5"> </h3>

<p>&nbsp;</p>

<h3 id="stochastic-moving-mnist-64x64-1">Stochastic Moving MNIST (64x64)</h3>

<h3 align="center" id="-5"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/SMMNIST_gen_big_c5t5f5_concat_videos_650000.gif" alt="SMMNIST gen c5t5f5"> </h3>

<p>&nbsp;</p>

<h1 align="center"> Video Interpolation </h1>

<ul>
  <li><em>left column (with frame number)</em> : real image</li>
  <li><em>right column</em> : predicted image</li>
</ul>

<h3 id="kth-64x64-2">KTH (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=10, <strong><code class="language-plaintext highlighter-rouge">interp</code>=10</strong>, <code class="language-plaintext highlighter-rouge">future</code>=5</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/KTH_interp_big_c10t10f5_SPADE_videos_75000.gif" alt="KTH_interp_big_c10t10f5_SPADE" title="KTH interp c10t10f5"></p>

<p>&nbsp;</p>

<h3 id="bair-64x64-2">BAIR (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=1, <strong><code class="language-plaintext highlighter-rouge">interp</code>=5</strong>, <code class="language-plaintext highlighter-rouge">future</code>=2</p>

<h3 align="center" id="-6"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/BAIR_interp_DDPM_PredPlusInterp_big_c1t5_SPADE_videos_100000.gif" alt="BAIR interp c1t5f2"> </h3>

<p>&nbsp;</p>

<h3 id="stochastic-moving-mnist-64x64-2">Stochastic Moving MNIST (64x64)</h3>

<p><code class="language-plaintext highlighter-rouge">past</code>=5, <strong><code class="language-plaintext highlighter-rouge">interp</code>=5</strong>, <code class="language-plaintext highlighter-rouge">future</code>=5</p>

<p><img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/SMMNIST_interp_big_c5t5f5_SPADE_videos_150000.gif" alt="SMMNIST_interp_big_c5t5_SPADE" title="SMMNIST interp c5t5f5"></p>

<p>&nbsp;</p>

<h2 align="center" id="spatin-architecture"> SPATIN Architecture </h2>

<h3 align="center" id="-7"> <img src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/SPATIN.svg" alt="SPATIN" width="85%"> </h3>



      
    </div>
    <script src="./Masked Conditional Video Diffusion for Generation, Prediction, and Interpolation1_files/anchor.min.js" integrity="sha256-lZaRhKri35AyJSypXXs4o6OPFTbTmUoltBbDCbdzegg=" crossorigin="anonymous"></script>
    <script>anchors.add();</script>
  

</body></html>