<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <!-- Meta tags for social media banners, these should be filled in appropriately as they are your "business card" -->
  <!-- Replace the content tag with appropriate information -->
  <meta name="description" content="DESCRIPTION META TAG">
  <meta property="og:title" content="SOCIAL MEDIA TITLE TAG"/>
  <meta property="og:description" content="SOCIAL MEDIA DESCRIPTION TAG TAG"/>
  <meta property="og:url" content="URL OF THE WEBSITE"/>
  <!-- Path to banner image, should be in the path listed below. Optimal dimensions are 1200X630-->
  <meta property="og:image" content="static/image/your_banner_image.png" />
  <meta property="og:image:width" content="1200"/>
  <meta property="og:image:height" content="630"/>

  <meta name="twitter:title" content="TWITTER BANNER TITLE META TAG">
  <meta name="twitter:description" content="TWITTER BANNER DESCRIPTION META TAG">
  <!-- Path to banner image, should be in the path listed below. Optimal dimensions are 1200X600-->
  <meta name="twitter:image" content="static/images/your_twitter_banner_image.png">
  <meta name="twitter:card" content="summary_large_image">
  <!-- Keywords for your paper to be indexed by-->
  <meta name="keywords" content="KEYWORDS SHOULD BE PLACED HERE">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception</title>
  <link rel="icon" type="image/x-icon" href="static/images/favicon.ico">
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">

  <link rel="stylesheet" href="static/css/bulma.min.css">
  <link rel="stylesheet" href="static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="static/css/fontawesome.all.min.css">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
  <script defer src="static/js/fontawesome.all.min.js"></script>
  <script src="static/js/bulma-carousel.min.js"></script>
  <script src="static/js/bulma-slider.min.js"></script>
  <script src="static/js/index.js"></script>
</head>
<body>

  <section class="hero">
    <div class="hero-body">
      <div class="container is-max-desktop">
        <div class="columns is-centered">
          <div class="column has-text-centered">
            <h1 class="title is-1 publication-title">WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception</h1>
            <div class="is-size-5 publication-authors">
              <!-- Paper authors -->
              <span class="author-block">Anonymous authors</span>
            </div>
          </div>
        </div>
      </div>
    </div>
  </section>

  <!-- Paper abstract -->
  <section class="section hero is-light">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">Abstract</h2>
          <div class="content has-text-justified">
            <p>
              Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.
            </p>
          </div>
        </div>
      </div>
    </div>
  </section>
  <!-- End paper abstract -->

  <section class="section hero is-small">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">Core Contributions</h2>
          <div class="content has-text-justified">
            <ul>
              <li>Systematically exploring the role of image-based perceptual condition, such as depth and optical flow, in enhancing long-horizon video generation as auxiliary signals.</li>
              <li>Proposing a unified framework that integrates perceptual conditioning and memory mechanisms for robust long-horizon video prediction.</li>
              <li>Extensive validation across different generative models and datasets, including both general-purpose and robotic manipulation domains, highlighting the potential of our approach as a foundation for scalable world models.</li>
            </ul>
          </div>
        </div>
      </div>
    </div>
  </section>

  <!-- Paper Results -->
  <section class="section hero is-small">
    <div class="container is-max-desktop">
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">Qualitative Comparison</h2>
          <div class="content has-text-justified">
            <p>
              Below, we present a qualitative comparison of video generation results across three models: Ours, MAGI, and SkyReels-V2. Each row corresponds to a different prompt, with videos generated by each model displayed side by side for comparison.
            </p>

            <!-- Column Headers -->
            <div class="columns is-centered">
              <div class="column is-one-third has-text-centered">
                <h3 class="title is-5">Ours</h3>
              </div>
              <div class="column is-one-third has-text-centered">
                <h3 class="title is-5">MAGI</h3>
              </div>
              <div class="column is-one-third has-text-centered">
                <h3 class="title is-5">SkyReels-V2</h3>
              </div>
            </div>

            <!-- First Row: Woman -->
            <div class="content has-text-centered">
              <p>Prompt: A woman walks down the street and smiles, she puts on sunglasses and keeps walking, she stops and waves at the camera, then turns back and walks away.</p>
            </div>
            <div class="columns is-centered">
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/ours/woman_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/magi/woman_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/sky/woman_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
            </div>
            <div class="content has-text-centered">
            </div>

            <!-- Second Row: Couple -->
            <div class="content has-text-centered">
              <p>Prompt: An elderly couple walks hand in hand in the park. They chat and smile as they stroll. The man feeds the woman a small treat. The camera zooms in on their happy laughter.</p>
            </div>
            <div class="columns is-centered">
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/ours/couple_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/magi/couple_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/sky/couple_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
            </div>
            <div class="content has-text-centered">
            </div>

            <!-- Third Row: Girl -->
            <div class="content has-text-centered">
              <p>Prompt: A young woman types on her laptop in a coffee shop, she takes a sip and checks her schedule, receives a message and smiles, then closes her computer to leave.</p>
            </div>
            <div class="columns is-centered">
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/ours/girl_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/magi/girl_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
              <div class="column is-one-third has-text-centered">
                <video controls style="width: 90%; height: auto;">
                  <source src="./videos/sky/girl_converted.mp4" type="video/mp4">
                  Your browser does not support the video tag.
                </video>
              </div>
            </div>
            <div class="content has-text-centered">
            </div>

          </div>
        </div>
      </div>

      <!-- New Section: More Results -->
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">More Results</h2>
           <div class="columns is-centered">
            <div class="column is-half has-text-centered">
              <div class="content has-text-centered">
                <p>Prompt: A little girl sits by the window on a rainy day, she draws shapes on the foggy glass, her mother brings her hot chocolate, together she watches the rain..</p>
              </div>
              <video controls style="width: 90%; height: auto;">
                <source src="./videos/ours/1.mp4" type="video/mp4">
                Your browser does not support the video tag.
              </video>
            </div>
            <div class="column is-half has-text-centered">
              <div class="content has-text-centered">
                <p>Prompt: A young man jogs around a peaceful lake at dawn, he stops to catch his breath and stretch, he takes a photo of the sunrise, then continues running with determination.</p>
              </div>
              <video controls style="width: 90%; height: auto;">
                <source src="./videos/ours/9.mp4" type="video/mp4">
                Your browser does not support the video tag.
              </video>
            </div>
          </div>
          <div class="content has-text-justified is-size-6">
            <p>
              Below, we showcase long-horizon video generation results for <strong>robotic arm tasks</strong>, demonstrating complex manipulation sequences driven solely by text prompts. Our approach generates these videos entirely from instructions, without relying on action guidance, <strong>unlike prior methods that focus on short-term reconstruction accuracy in simple scenes.</strong>
            </p>
          </div>

          <!-- Single Row: Robotic Arm Videos -->
          <div class="columns is-centered">
            <div class="column is-half has-text-centered">
              <div class="content has-text-centered">
                <p>Prompt: A robot arm picks up a blue cup from the sink area and places it on a tray, then picks up an orange cup places it on a tray, then picks up the green one places it on a tray.</p>
              </div>
              <video controls style="width: 90%; height: auto;">
                <source src="./videos/robotic/1.mp4" type="video/mp4">
                Your browser does not support the video tag.
              </video>
            </div>
            <div class="column is-half has-text-centered">
              <div class="content has-text-centered">
                <p>Prompt: The robotic arm moves downward, approaching drawer and open it, then the robotic arm moves up, then the robotic arm approaches the green can, picks it up and puts it in the drawer, finally it approaches the black bowl, grips and puts it in the drawer. </p>
              </div>
              <video controls style="width: 90%; height: auto;">
                <source src="./videos/robotic/2.mp4" type="video/mp4">
                Your browser does not support the video tag.
              </video>
            </div>
          </div>
        </div>
      </div>
    </div>
  </section>
  <!-- End Paper Results -->

  <footer class="footer">
    <div class="container">
      <div class="columns is-centered">
        <div class="column is-8">
          <div class="content">
            <p>
              This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a>.
              <br> This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
            </p>
          </div>
        </div>
      </div>
    </div>
  </footer>

</body>
</html>