<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning">
  <meta name="keywords" content="video,video generation,gpt,video director,video gpt">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>VideoDirectorGPT</title>

  <!-- Global site tag (gtag.js) - Google Analytics -->
  <!-- <script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script> -->
  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <!-- <link rel="icon" href="./static/images/favicon.svg"> -->

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>

<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">VideoDirectorGPT: Consistent Multi-Scene<br>Video Generation via LLM-Guided Planning</h1>
        </div>
      </div>
    </div>
  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Generated Examples</h2>
        <div class="content has-text-justified">
          <p>
            We provide full video samples that are described in our paper.
            For <b>single-scene video generation</b>, we evaluate layout control via VPEval Skill-based prompts,
            assess object dynamics through ActionBench-Direction prompts adapted from
            ActionBench-SSV2, and examine open-domain video generation using the MSR-VTT dataset. 
            For <b>multi-scene video generation</b>, we experiment with two types of input prompts:
            (1) a list of sentences describing events &#8212; ActivityNet Captions
            and Coref-SV prompts based on Pororo-SV, and (2) a single sentence from which models
            generate multi-scene videos &#8212; HiREST. In addition, we show generated videos from 
            text-only using Karlo and image+text with <b>user-provided images</b>, as well as <b>human-in-the-loop editing</b>.
          </p>      
        </div>
        <div class="content is-centered has-text-centered">
          
          <h3>Coref-SV</h3>
          <div class="content example">
            <div class="example-prompt" style="text-align: left; width: 80%;margin:auto">
              Scene 1: <b style="color:red;">mouse</b> is holding a book and makes a happy face.<br>
              Scene 2: <b style="color:red;">he</b> looks happy and talks.<br>
              Scene 3: <b style="color:red;">he</b> is pulling petals off the flower.<br>
              Scene 4: <b style="color:red;">he</b> is ripping a petal from the flower.<br>
              Scene 5: <b style="color:red;">he</b> is holding a flower by <b style="color:red;">his</b> right paw.<br>
              Scene 6: one paw pulls the last petal off the flower.<br>
              Scene 7: <b style="color:red;">he</b> is smiling and talking while holding a flower on <b style="color:red;">his</b> right paw.
            </div>
            <br>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/coref_modelscope_1.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/coref_ours_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          <p>Video generation examples on a Coref-SV prompt. Our video plan's
            object layouts (overlaid) can guide the Layout2Vid module to generate the same mouse
            and flower across scenes consistently, whereas ModelScopeT2V loses track of the mouse
            right after the first scene, generating a human hand and a dog instead of a mouse, and
            the flower changes color.</p>
          </div>
          <div class="content example">
            <div class="example-prompt" style="text-align: left; width: 80%;margin:auto">
              Scene 1: it's snowing outside.<br>
              Scene 2: <b style="color:red;">dog</b> is singing and dancing.<br>
              Scene 3: <b style="color:red;">its</b> friends are encouraging <b style="color:red;">it</b> to do something.<br>
              Scene 4: <b style="color:red;">its</b> friends are applauding at <b style="color:red;">it</b>.<br>
              Scene 5: <b style="color:red;">it</b> is bowing to the audience after the performance.
            </div>
            <br>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/coref_modelscope_2.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/coref_ours_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          <p>Video generation examples on a Coref-SV prompt.
            Our video plan's object layouts (overlaid) can guide the Layout2Vid module to
            generate the same brown dog and maintain snow across scenes consistently,
            whereas ModelScopeT2V generates different dogs in different scenes and loses
            the snow after the first scene.</p>
          </div>
          <hr>
          <h3>HiREST</h3>
          <div class="content example">
            <p class="example-prompt">make caraway cakes</p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/hirest_modelscope_1.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/hirest_ours_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
            <p>Comparison of generated videos on a HiREST prompt.
              Our model is able to generate a detailed video plan that properly expands
              the original text prompt to show the process, has accurate object bounding
              box locations (overlaid), and maintains the consistency of the person across
              the scenes. ModelScopeT2V only generates the final caraway cake and that cake
              is not consistent between scenes.</p>
          </div>
          <div class="content example">
            <p class="example-prompt">make a strawberry surprise</p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/hirest_modelscope_2.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/hirest_ours_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          <p>Comparison of generated videos on a HiREST prompt.
            Our VideoDirectorGPT generates a detailed video plan that properly expands the original text prompt,
            ensures accurate object bounding box locations (overlaid), and maintains the consistency of the
            person across the scenes. ModelScopeT2V only generates the strawberries.</p>
          </div>
          <hr>
          <h3>ActionBench-Direction prompts</h3>
          <div class="content example">
            <p class="example-prompt">pushing <b style="color:red;">stuffed animal</b> from <b>left to right</b></p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/action_modelscope_1.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/action_ours_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
          </div>
          <div class="content example">
            <p class="example-prompt">pushing <b style="color:red;">pear</b> from <b>right to left</b></p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/action_modelscope_2.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/action_ours_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
          <p>Video generation examples on ActionBench-Direction prompts.
            Our video plan's object layouts (overlaid) can guide the Layout2Vid module to place
            and move the 'stuffed animal' and 'pear' in their correct respective directions,
            whereas the objects in the ModelScopeT2V videos stay in the same location or move
            in random directions.</p>
          </div>
          <hr>
          <h3>VPEval Skill-based prompts</h3>
          <div class="content example">
            <p class="example-prompt">a <b style="color:green;">pizza</b> is to the <b>left</b> of an <b style="color:red;">elephant</b></p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/vpeval_modelscope_1.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/vpeval_ours_1.gif" alt="Teaser" width="80%">
                </div>
            </div>
          </div>
          <div class="content example">
            <p class="example-prompt"><b>four</b> frisbees</p>
            <div class="example-gifs">
                <div>
                  <p><b>ModelScopeT2V</b></p>
                  <img src="./static/images/vpeval_modelscope_2.gif" alt="Teaser" width="80%">
                </div>
                <div>
                  <p><b>VideoDirectorGPT (Ours)</b></p>
                  <img src="./static/images/vpeval_ours_2.gif" alt="Teaser" width="80%">
                </div>
            </div>
            <br>
            <p>Video generation examples on VPEval Skill-based prompts for spatial and count skills.
              Our video plan, with object layouts overlaid, successfully guides the Layout2Vid module to place objects
              in the correct spatial relations and to depict the correct number of objects,
              whereas ModelScopeT2V fails to generate 'pizza' in the first example and
              overproduces the number of frisbees in the second example.</p>
          </div>
          <hr>
          <h3>User-Provided Input Image &rarr; Video</h3>
          <div class="content example">
            <div class="example-prompt" style="text-align: left; width: 85%;margin:auto">
              Scene 1: a &lt;S&gt; then gets up from a plush beige bed.<br>
              Scene 2: a &lt;S&gt; goes to the cream-colored kitchen and eats a can of gourmet cat snack.<br>
              Scene 3: a &lt;S&gt; sits next to a large floor-to-ceiling window.
            </div>
            <br>
            <div class="example-gifs">
                <div style="display: block;position: relative;">
                  <p><b>Input</b></p>
                  <div style="font-size: 1.3em;position: relative;top: 50%;transform: translateY(-50%);">&lt;S&gt; = "white cat"</div>
                </div>
                <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
                <div>
                  <p><b>Generated Gif</b></p>
                  <img src="./static/images/exemplar/white_cat.gif" style="position: relative;top: 50%;transform: translateY(-50%);" alt="Teaser" width="80%">
                </div>
            </div>
            <br><br>
            <!-- position: relative;top: 50%;transform: translateY(-50%); -->
            <div class="example-gifs">
                <div style="display: block;position: relative;text-align: center;">
                  <p><b>Input</b></p>
                  <div style="font-size: 1.3em;">&lt;S&gt; = "cat"</div>
                  <div>+</div>
                  <div style="display: flex;flex-wrap: wrap;justify-content: center;">
                    <img class="example-input" src="./static/images/exemplar/cat1.png">
                    <img class="example-input" src="./static/images/exemplar/cat2.png">
                    <img class="example-input" src="./static/images/exemplar/cat3.png">
                    <img class="example-input" src="./static/images/exemplar/cat4.png">
                  </div>
                </div>
                <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
                <div>
                  <p><b>Generated Gif</b></p>
                  <img src="./static/images/exemplar/cat.gif" style="position: relative;top: 50%;transform: translateY(-50%);" alt="Teaser" width="80%">
                </div>
            </div>
            <br><br>
            <div class="example-gifs">
              <div style="display: block;position: relative;text-align: center;">
                <p><b>Input</b></p>
                <div style="font-size: 1.3em;">&lt;S&gt; = "cat"</div>
                <div>+</div>
                <div style="display: flex;flex-wrap: wrap;justify-content: center;">
                  <img class="example-input" src="./static/images/exemplar/siamese cat1.png">
                  <img class="example-input" src="./static/images/exemplar/siamese cat2.png">
                  <img class="example-input" src="./static/images/exemplar/siamese cat3.png">
                  <img class="example-input"src="./static/images/exemplar/siamese cat4.png">
                </div>
              </div>
              <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
              <div>
                <p><b>Generated Gif</b></p>
                <img src="./static/images/exemplar/siamese_cat.gif" style="position: relative;top: 50%;transform: translateY(-50%);" alt="Teaser" width="80%">
              </div>
          </div>
          <br><br>
          <div class="example-gifs">
            <div style="display: block;position: relative;text-align: center;">
              <p><b>Input</b></p>
              <div style="font-size: 1.3em;">&lt;S&gt; = "teddy bear"</div>
              <div>+</div>
              <div style="display: flex;flex-wrap: wrap;justify-content: center;">
                <img class="example-input" src="./static/images/exemplar/teddy bear1.png">
                <img class="example-input" src="./static/images/exemplar/teddy bear2.png">
                <img class="example-input" src="./static/images/exemplar/teddy bear3.png">
                <img class="example-input" src="./static/images/exemplar/teddy bear4.png">
              </div>
            </div>
            <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
            <div>
              <p><b>Generated Gif</b></p>
              <img src="./static/images/exemplar/teddy_bear.gif" style="position: relative;top: 50%;transform: translateY(-50%);" alt="Teaser" width="80%">
            </div>
        </div>
            <br>
            <p>Video generation examples with custom entities.
              Users can flexibly provide either text-only (1st row) or image+text (2nd to 4th rows)
              descriptions to place custom entities when generating videos with VideoDirectorGPT.
              For both text and image+text based entity grounding examples, the identities of the
              provided entities are well preserved across multiple scenes.</p>
          </div>
          
          <hr>
          <h3>Human-in-the-Loop Editing</h3>
          <div class="content example">
            <br><br>
            <div class="example-gifs">
              <div>
                <p><b>Original Gif</b></p>
                <img src="./static/images/Original.gif" style="position: relative;top: 49%;transform: translateY(-50%);" alt="Teaser" width="80%">
              </div>
              <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
              <div style="display: block;position: relative;text-align: center;">
                <p><b>Human Edit</b></p>
                <div style="font-size: 1.3em;">Make the horse smaller</div>
                <img src="./static/images/Edit1.gif" alt="Teaser" width="80%">
              </div>
            </div>
            <br><br>
            <div class="example-gifs">
              <div>
                <p><b>Original Gif</b></p>
                <img src="./static/images/Original.gif" style="position: relative;top: 49%;transform: translateY(-50%);" alt="Teaser" width="80%">
              </div>
              <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
              <div style="display: block;position: relative;text-align: center;">
                <p><b>Human Edit</b></p>
                <div style="font-size: 1.3em;">Add "grassland" background</div>
                <img src="./static/images/Edit2.gif" alt="Teaser" width="80%">
              </div>
            </div>
            <br><br>
            <div class="example-gifs">
              <div>
                <p><b>Original Gif</b></p>
                <img src="./static/images/Original.gif" style="position: relative;top: 49%;transform: translateY(-50%);" alt="Teaser" width="80%">
              </div>
              <img src="./static/images/arrow_icon.jpg" style="height:50px;position: absolute;top: 50%;transform: translateY(-50%);">
              <div style="display: block;position: relative;text-align: center;">
                <p><b>Human Edit</b></p>
                <div style="font-size: 1.3em;">Add "night street" background</div>
                <img src="./static/images/Edit3.gif" alt="Teaser" width="80%">
              </div>
            </div>
            <br>
            <p>Video generation examples for human-in-the-loop editing.
              Users can modify the video plan (e.g., add/delete objects, change the background and entity layouts, etc.)
              to generate customized video contents.
              Given the same text prompt "A horse running", we provide visualizations with a smaller horse and
              different backgrounds (i.e., "night street" and "grassland").</p>
          </div>
        </div>

      </div>
    </div>
  </div>
</section>

</body>
</html>
