<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization">
  <meta name="keywords" content="video,video generation,training-free,structrual noise">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Video-MSG</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <!-- <script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script> -->
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <!-- <link rel="icon" href="./static/images/favicon.svg"> -->

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>

<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">Training-free Guidance in Text-to-Video Generation <br> via Multimodal Planning and Structured Noise Initialization</h1>

        </div>
      </div>
    </div>
  </div>
</section>



<section class="section">
  <div class="container is-max-desktop">
    <!-- Abstract. -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <center>
          <img src="./static/images/method.png" width="90%">
        </center>
        <br>
        <h2 class="title is-3">Abstract</h2>

        <img src="./static/images/teaser.png" alt="Teaser" width="80%">

        <div class="content has-text-justified">

          Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
          </p>
        </div>
      </div>
    </div>
  </div>
</section>



<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Method</h2>

        <img src="./static/images/method.png" alt="Teaser" width="80%">
        <div class="content has-text-justified">
          We introduce <b>Video-MSG</b>, <b>M</b>ultimodal <b>S</b>ketch <b>G</b>uidance for video generation, a training-free guidance method for T2V generation based on multimodal planning and structured noise initialization. Video-MSG consists of three stages (illustrated in Fig. 2):
          1. Background planning, where we adopt T2I and I2V models to generate background image priors with natural animation.
          2. Foreground Object Layout and Trajectory Planning, where we apply MLLM and object detectors to plan and place foreground objects into the background harmoniously.
          3. Video Generation with Structured Noise Initialization, where the synthesized images derived from the above stages are used as Video Sketch for final video generation via inversion techniques.

        </div>

      </div>
    </div>
  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Qualitative Examples</h2>

        <div class="content is-centered has-text-centered">

          <div class="content example">
            <p class="example-prompt">Motion: An egg rolling from the right to left on the table.</p>
            <div class="example-gifs">
                <div>
                  <p><b>CogVideoX-5B</b></p>
                  <img src="./static/images/qualitative_examples/baseline_videos/motion_1.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Video Sketch)</b></p>
                  <img src="./static/images/qualitative_examples/video_sketch/0068.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Final Video)</b></p>
                  <img src="./static/images/qualitative_examples/ours_videos/motion_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          </div>
          <div class="content example">
            <p class="example-prompt">Numeracy: Six penguins waddle together across an icy landscape.</p>
            <div class="example-gifs">
                <div>
                  <p><b>CogVideoX-5B</b></p>
                  <img src="./static/images/qualitative_examples/baseline_videos/numeracy_2.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Video Sketch)</b></p>
                  <img src="./static/images/qualitative_examples/video_sketch/0080.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Final Video)</b></p>
                  <img src="./static/images/qualitative_examples/ours_videos/numeracy_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          </div>
          <div class="content example">
            <p class="example-prompt">Spatial: A gorilla sitting on the left side of a vending machine in a forest.</p>
            <div class="example-gifs">
                <div>
                  <p><b>CogVideoX-5B</b></p>
                  <img src="./static/images/qualitative_examples/baseline_videos/spatial_1.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Video Sketch)</b></p>
                  <img src="./static/images/qualitative_examples/video_sketch/0011.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Final Video)</b></p>
                  <img src="./static/images/qualitative_examples/ours_videos/spatial_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          </div>
          <div class="content example">
            <p class="example-prompt">Spatial: A child building a sandcasle on the right of a beach umbrella.</p>
            <div class="example-gifs">
                <div>
                  <p><b>CogVideoX-5B</b></p>
                  <img src="./static/images/qualitative_examples/baseline_videos/spatial_2.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Video Sketch)</b></p>
                  <img src="./static/images/qualitative_examples/video_sketch/0021.gif" alt="Teaser" width="80%">
                </div>

                <div>
                  <p><b>Ours (Final Video)</b></p>
                  <img src="./static/images/qualitative_examples/ours_videos/spatial_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          </div>



          </div>
        </div>

      </div>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">
    <!-- Abstract. -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Quantitative Results</h2>

        <center>
          <img src="./static/images/result.png" width="80%">
        </center>

      </div>
    </div>
  </div>
</section>



<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          The webpage was adapted from <a href="https://github.com/nerfies/nerfies.github.io">nerfies</a>.
        </div>
      </div>
    </div>
  </div>
</footer>

</body>
</html>
