<link href="https://fonts.cdnfonts.com/css/chalkduster" rel="stylesheet">
<style>
    @import url('https://fonts.cdnfonts.com/css/chalkduster');
</style>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<title>Looking Backward: Streaming Video-to-Video Translation with Feature Banks</title>
<link href="style.css" rel="stylesheet" type="text/css">
</head>

<body>
	<button style="position: fixed;right: 15px;top:  50%;height: 100px;width: 140px; font-size: 20px;" type="button"><a href="supp.html#top">Back to top</a></button> 
<div class="page-container">
  <h1 align="center">Looking Backward: Streaming Video-to-Video Translation with Feature Banks</h1>
  <h2 align="center">Supplementary Material</h2>
	
  <p align="center">&nbsp;</p>
	<a href="supp.html#top"></a>
  <ul>
	<li><a href="supp.html#highlight">StreamV2V highlight</a></li>
  <li><a href="supp.html#camera_demo">Camera demo on one 4090TI GPU</a></li>
	<li><a href="supp.html#camera_demo">Comparison with state-of-the-art video-to-video methods</a></li>
  <li><a href="supp.html#Ablations">Ablations</a></li>
  <li><a href="supp.html#Continuous_t2i">Continuous image generation with feature bank</a></li>
  <li><a href="supp.html#long_video">Long video (>1000 frames) translation</a></li>
  <li><a href="supp.html#Limitations">Limitations</a></li>
  </ul>
  <p><br><span class="emph">All videos are compressed. We recommend watching all videos in full screen. Click on the videos for seeing them in full scale. </span></p>
	
  <!------------------ BEGIN SECTION ------------------>

  <p>&nbsp;</p>
  <hr>

  <h2 id="highlight" align="left"><a name="image-results" id="image-results"></a>Our StreamV2V highlight</h2>
  <section class="hero teaser">
    <div class="container is-max-desktop">
      <div class="hero-body">
        <video poster="" id="tree" autoplay controls loop width="100%">
          <source src="./assets/streamv2v_video_downsize.mp4" type="video/mp4">
        </video>
      </div>
    </div>
  </section>

  <p>&nbsp;</p>
  <hr>


  <h2 id="camera_demo" align="left"><a name="image-results" id="image-results"></a>Camera demo on one 4090TI GPU</h2>
  <section class="hero teaser">
    <div class="container is-max-desktop">
      <div class="hero-body">
        <video poster="" id="tree" autoplay controls loop width="100%">
          <source src="./assets/camera_demo_w_mosic_downsize.mp4" type="video/mp4">
        </video>
      </div>
    </div>
  </section>

    <!------------------ END SECTION ------------------>
    
  <!------------------ BEGIN SECTION ------------------>
  <p>&nbsp;</p>
  <hr>
	
  <h2 id="comparisons_baselines_container" align="left"><a name="image-results" id="image-results"></a>Comparison with state-of-the-art video-to-video methods</h2>
  <p align="left"> We compare our StreamV2V to 
    <!-- Existing methods of text-guided video editing suffer from temporal inconsistency. -->
    <ul>
        <li>StreamDiffusion (<a href="supp.html#ref-streamdiffusion">[1]</a>).</li> 
        <li>CoDeF (<a href="supp.html#ref-codef">[2]</a>).</li> 
        <li>Rerender-a-Video (<a href="supp.html#ref-rerender">[3]</a>)</li>
        <li>FlowVid (<a href="supp.html#ref-flowvid">[4]</a>)</li>
</ul>

  <table  width="100%" align="center">

      <tr>
        <tr>
          <th style="font-size: 20px" colspan="3">(Fig.5) Edit Prompt: A <span style="color: red;">pixel art</span> of a man doing a handstand on the street.</th>
      </tr>
        <tr>
            <th style="font-size: 16px">Input video</th>
            <th style="font-size: 16px">Ours</th>
            <th style="font-size: 16px">StreamDiffusion (<a href="supp.html#ref-streamdiffusion">[1]</a>)</th>
        </tr>
        <tr>                
          <th><video width="400" src="assets/comparsion/breakdance-flare.mp4" autoplay loop controls muted /></th>
          <th><video width="400" src="assets/comparsion/streamv2v-breakdance-flare_pixelart_0.mp4" autoplay loop controls muted /></th>
          <th><video width="400" src="assets/comparsion/streamdiffusion-breakdance-flare_pixelart_0.mp4" autoplay loop controls muted /></th>           
          </tr>
          <tr>
            <th style="font-size: 16px">CoDeF (<a href="supp.html#ref-codef">[2]</a>)</th>
            <th style="font-size: 16px">Rerender (<a href="supp.html#ref-rerender">[3]</a>)</th>
            <th style="font-size: 16px">FlowVid (<a href="supp.html#ref-flowvid">[4]</a>)</th>
        </tr>
          <tr>
            <th style="padding-bottom: 30px;"><video width="400" src="assets/comparsion/codef-breakdance-flare_pixelart.mp4" autoplay loop controls muted /></th>
            <th style="padding-bottom: 30px;"><video width="400" src="assets/comparsion/rerender-breakdance-flare_pixelart_1.mp4" autoplay loop controls muted /></th>
            <th style="padding-bottom: 30px;"><video width="400" src="assets/comparsion/flowvid-breakdance-flare_pixelart_0.mp4" autoplay loop controls muted /></th>     
        </tr>

</table>
<!------------------ END SECTION ------------------>

<!------------------ BEGIN SECTION ------------------>
<p>&nbsp;</p>
<hr>
  
<h2 id="Ablations" align="left"><a name="image-results" id="image-results"></a>Ablations</h2>
<table width="100%" align="center">
  <tbody>
    <tr>                
      <th style="font-size: 18px; padding-bottom: 20px;">(Fig.8) Ablation on Extended self-Attention (EA) and Feature Fusion (FF). Edit prompt: "A man is surfing, in animation".</th>
    </tr>
    <tr>                
      <th style="padding-bottom: 25px;"><video width="100%" src="assets/ablations/EA_FF.mp4" autoplay loop controls muted /></th>
    </tr>
    <tr> 
      <td style="height: 50px;"></td> <!-- Adjust the height here to increase or decrease the space -->
    </tr>
  </tbody>
</table>

<table  width="100%" align="center">
  <tbody>
    <tr>                
      <th style="font-size: 18px; padding-bottom: 10px;">(Fig.13) Ablation on different denoising steps. While using fewer denoising steps would accelerate the inference time for every frame, we do observe a certain level of quality drop if we use only 1 step.</th>
      </tr>
      <tr>                
        <th style="font-size: 16px; padding-bottom: 0px;">Elon Musk</th>
        </tr>
    <tr>                
      <th style="padding-bottom: 20px;"><video width="100%" src="assets/ablations/ElonMusk_steps.mp4" autoplay loop controls muted /></th>
      </tr>
      <tr>                
        <th style="font-size: 16px; padding-bottom: 10px;">Clyamation</th>
        </tr>
      <tr>               
        <th><video width="100%" src="assets/ablations/Claymation_steps.mp4" autoplay loop controls muted /></th>
        </tr>
      <tr> </tr>

  </tbody>
</table> 

<!------------------ END SECTION ------------------>	


<p>&nbsp;</p>
<hr>


<h2 id="Continuous_t2i" align="left"><a name="image-results" id="image-results"></a>Continuous image generation with feature bank</h2>
<section class="hero teaser">
  <div class="container is-max-desktop">
    <div class="hero-body">
      <video poster="" id="tree" autoplay controls loop width="100%">
        <source src="./assets/continuous_t2i.mp4" type="video/mp4">
      </video>
    </div>
  </div>
</section>



<p>&nbsp;</p>
<hr>


<h2 id="long_video" align="left"><a name="image-results" id="image-results"></a>Long video (>1000 frames) translation</h2>
<section class="hero teaser">
  <div class="container is-max-desktop">
    <div class="hero-body">
      <video poster="" id="tree" autoplay controls loop width="100%">
        <source src="./assets/half_minute_video.mp4" type="video/mp4">
      </video>
    </div>
  </div>
</section>


<!------------------ BEGIN SECTION ------------------>
<p>&nbsp;</p>
<hr>
  
<h2 id="Limitations" align="left"><a name="image-results" id="image-results"></a>Limitations</h2>
<br/>

<table  width="100%" align="center">
  <tbody>
    <tr>                
      <th style="font-size: 18px; padding-bottom: 10px;">(Fig.9) Limitations of StreamV2V.</th>
      </tr>
      <tr>                
        <th style="font-size: 16px; padding-bottom: 0px;">(a). StreamV2V fails to alter the person within the input video into Pope or Batman.</th>
        </tr>
    <tr>                
      <th style="padding-bottom: 20px;"><video width="100%" src="assets/limitations/limitation_selfie.mp4" autoplay loop controls muted /></th>
      </tr>
      <tr>                
        <th style="font-size: 16px; padding-bottom: 10px;">(b). StreamV2V can produce inconsistent output, as seen in the girl for Anime style and the backpack straps for Van Gogh style.</th>
        </tr>
      <tr>             
        <th><video width="100%" src="assets/limitations/limitation_walking.mp4" autoplay loop controls muted /></th>
        </tr>
      <tr> </tr>

  </tbody>
</table> 
<!------------------ END SECTION ------------------>	



  <p><br>
  </p>
  <p>&nbsp;</p>
  <p>&nbsp;</p>
  <p>&nbsp;</p>


  <p>
    <a name="ref-streamdiffusion" id="ref-streamdiffusion"></a>
    [1] Kodaira, Akio, et al. "StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation." arXiv preprint arXiv:2312.12491 (2023).
  </p>
  <p>
    <a name="ref-codef" id="ref-codef"></a>
    [2] Ouyang, Hao, et al. "Codef: Content deformation fields for temporally consistent video processing." arXiv preprint arXiv:2308.07926 (2023).
  </p>
  <p>
    <a name="ref-rerender" id="ref-rerender"></a>
    [3] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation, 2023.
  </p>
  <p>
    <a name="ref-flowvid" id="ref-flowvid"></a>
    [4] Liang, Feng, et al. "FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis." arXiv preprint arXiv:2312.17681 (2023).
  </p>
</div>

</body></html>