<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Supplementary Material</title>
  <style>
    body {
      font-family: sans-serif;
      max-width: 1200px;
      margin: 2rem auto;
      line-height: 1.5;
      background: #fff;
    }

    h1, h2, h3 {
      margin-top: 2rem;
    }

    .figure-grid {
      display: grid;
      grid-template-columns: repeat(2, 1fr);
      gap: 1rem 2rem;
      margin-bottom: 2rem;
      padding: 0 1rem;
    }

    .figure-grid .full-span {
      grid-column: 1 / -1;
      text-align: center;
    }

    .figure-grid figure {
      margin: 0;
      width: 100%;
    }

    .figure-grid img,
    .figure-grid video {
      width: 100%;
      height: auto;
      max-height: 1080px;
      object-fit: contain;
      background: transparent;
      display: block;
      margin: 0 auto;
    }

    figcaption {
      text-align: center;
      font-size: 0.9rem;
      margin-top: 0.5rem;
    }

    .prompts-grid {
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
      gap: 1rem;
      margin-top: 1rem;
      padding: 0 1rem;
    }

    .prompts-grid figure video {
      aspect-ratio: 16 / 9;
    }
  </style>
</head>
<body>

  <h1>Supplementary Material</h1>

  <p>In this HTML file, we present qualitative results for text and path conditioned human motion generation, 4D world generation, and velocity estimation from Doppler effects, organized consistently with the structure of the Appendix.</p>
  
  <h2 class="section-large">Conditional Human Motion Generation</h2>

  <p>For conditonal human motion generation. We begin with customized conditions to highlight the capabilities of our model, followed by qualitative examples from the test set of the HumanML3D dataset.</p>

  <h3 id="diffLength">Varying Path Lengths</h3>
  <p>We first fix the text condition to <em>“walk”</em>, maintain the path direction but vary path lengths of 1, 3, 5, and 7 meters. Paths are colored from <strong>blue</strong> (start) to <strong>red</strong> (end). The results below show that motions follow both the paths and the text.</p>
  <div class="figure-grid">
    <figure>
      <video controls src="assets/motions/1m.mp4"></video>
      <figcaption>Path length: 1 m</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/3m.mp4"></video>
      <figcaption>Path length: 3 m</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/5m.mp4"></video>
      <figcaption>Path length: 5 m</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/7m.mp4"></video>
      <figcaption>Path length: 7 m</figcaption>
    </figure>
  </div>

  <h3 id="diffDirection">Varying Path Directions</h3>
  <p>We then change the text to <em>“slowly walk”</em> and the path length is fixed. However, we vary path directions at ±90°, ±45° and ±30°. The below results show that the generated human motion moves noticeably slower while still following the path.</p>
  <div class="figure-grid">
    <figure>
      <video controls src="assets/motions/-90.mp4"></video>
      <figcaption>Path direction: –90°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/90.mp4"></video>
      <figcaption>Path direction: 90°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/-45.mp4"></video>
      <figcaption>Path direction: –45°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/45.mp4"></video>
      <figcaption>Path direction: 45°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/-30.mp4"></video>
      <figcaption>Path direction: –30°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/30.mp4"></video>
      <figcaption>Path direction: 30°</figcaption>
    </figure>
  </div>

  <h3 id="diffAction">Varying Text Descriptions</h3>
  <p>Now, we adopt the same path direction and length. However, we change the text to: <em>jump</em>, <em>run</em>, <em>walk as if there are stairs in the front</em>, and <em>wave their arms</em>. We observe that the generated human motions closely follow the provided text conditions.</p>
  <div class="figure-grid">
    <figure>
      <video controls src="assets/motions/jump.mp4"></video>
      <figcaption>Text: Jump</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/run.mp4"></video>
      <figcaption>Text: Run</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/stair.mp4"></video>
      <figcaption>Text: Walk as if there are stairs in the front</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/wave.mp4"></video>
      <figcaption>Text: Wave their arms</figcaption>
    </figure>
  </div>

  <h3 id="allDiff">Random Combinations</h3>
  <p>Lastly, we evaluate the generalization of the model to random text/path combinations, and we find our model understands and follows the various provided conditions to generate the corresponding motions.</p>
  <div class="figure-grid">
    <figure>
      <video controls src="assets/motions/run-rand.mp4"></video>
      <figcaption>Run; 8 m; 30°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/jump-rand.mp4"></video>
      <figcaption>Jump; 2.5 m; –90°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/stair-rand.mp4"></video>
      <figcaption>Walk as if there are stairs in the front; 3.5 m; 45°</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/wave-rand.mp4"></video>
      <figcaption>Wave arms; 2.0 m; –30°</figcaption>
    </figure>
  </div>

  <h3 id="testset">Performance on HumanML3D Test Set</h3>
  <p>We provide qualitative results on the HumanML3D test set.
    These ressults highlight the alignment between the generated motions and more complex text and path conditions, demonstrating the model’s ability to produce coherent and contextually accurate human motion.
    Generating these motions requires the model to jointly interpret the semantics of the text and the spatial constraints of the path, understanding not only what action is being described, but also where and how it should move.
    Text conditions are included in the subcaptions, and path conditions are visualized as points in the scene.</p>
  <div class="figure-grid">
    <figure>
      <video controls src="assets/motions/1.mp4"></video>
      <figcaption>Text: The person takes a step and waves his right hand back and forth.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/2.mp4"></video>
      <figcaption>Text: A man walks backwards and then stops.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/3.mp4"></video>
      <figcaption>Text: A person walks in a circular motion.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/4.mp4"></video>
      <figcaption>Text: A person bends to the right.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/5.mp4"></video>
      <figcaption>Text: A person begins walking forward first with their left foot, taking wide awkward steps as if they are stepping around or over something; begins walking towards the right and then slowly continues to walk to the left, then continues to walk towards the right coming to a stop off to the right side.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/6.mp4"></video>
      <figcaption>Text: The person was pushed but did not fall.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/7.mp4"></video>
      <figcaption>Text: A figure tip toes around while walking in a slolam like motion.</figcaption>
    </figure>
    <figure>
      <video controls src="assets/motions/8.mp4"></video>
      <figcaption>Text: A person who is walking moves forward taking six confident strides.</figcaption>
    </figure>
  </div>

  <h2 class="section-large">Generated 4D World</h2>

  <p>
    We present a series of dynamic 4D scenes generated by WaveVerse. We specifically show the alignment between the generated human motion and the environments.
    While Waveverse can effortlessly generate shorter motions in open or less constrained spaces, we emphasize its ability to handle more challenging scenarios, producing long, coherent motions within visually complex and spatially constrained environments.
    The generated scene provided below showcases qualitative results in such cases, including narrow hallways, intricate layouts, and long human motions. The generated motions align with the surrounding layout, navigating obstacles and fitting within the scene’s geometry.
    Interestingly, the motions sometimes appear to interact with the scene, such as in the 5th and 7th examples, even though no explicit interaction is modeled.

  </p>

  <div class="figure-grid">
    <figure>
      <img src="assets/generated_world/gallery.png" alt="Bird’s-eye view 1">
      <figcaption>A broad gallery; Slowly tour around</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/gallery.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/y-shaped.png" alt="Bird’s-eye view 2">
      <figcaption>A hallway; Wave the arm</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/y-shaped.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/zigzag.png" alt="Bird’s-eye view 3">
      <figcaption>A zigzag hallway; Navigate</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/zigzag.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/keyhole.png" alt="Bird’s-eye view 4">
      <figcaption>A keyhole-shaped hallway; Bend to pick something up</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/keyhole.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/cabin.png" alt="Bird’s-eye view 5">
      <figcaption>A cozy cabin kitchen; Walk to retrieve items</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/cabin.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/winding.png" alt="Bird’s-eye view 6">
      <figcaption>A winding corridor; Walk</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/winding.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/L-shaped.png" alt="Bird’s-eye view 6">
      <figcaption>A L shape hallway; Quickly Move</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/L-shaped.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
  </div>
  


  <p>
    Since WaveVerse does not explicitly model physical interaction between the human and the scene, occasional minor collisions may occur.
    For instance, in the following two examples, a hand or arm occasionally passes through nearby objects.
    However, these moments are brief, and they do not significantly affect the motion quality for downstream RF sensing tasks, where precise physical contact modeling is not required.
  </p>

  <div class="figure-grid">
    <figure>
      <img src="assets/generated_world/bathroom.png" alt="Bird’s-eye view 1">
      <figcaption>A chic bathroom; Walk and almost slip</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/bathroom.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
    <figure>
      <img src="assets/generated_world/u-shaped.png" alt="Bird’s-eye view 2">
      <figcaption>A U-shaped hallway; Jump</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/u-shaped.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
  </div>

   <p>
    Lastly, we present an interesting and challenging case where the character performs a dance sequence in a cluttered environment filled with objects.
    Despite the complexity of the motion, it still manages to avoid collisions with the surrounding objects, showing its spatial awareness.
    We also observe slight jitter in this example, caused by SMPL fitting artifacts under standard procedures when handling extremely abrupt, high-speed motions.
    we confirm such artifacts are very rare and occur only in limited cases.
    <!-- Additionally, we notice minor jitter in this example, resulting from SMPL fitting artifacts under standard procedures when motions are extremely abrupt and high-speed.
    However, such cases are very rare and We confirm that this phenomenon occurs only in very limited cases. -->
    Nonetheless, this example highlights the capability of WaveVerse in handling complex, high-dynamic motions within constrained scenes.
   </p>

  <div class="figure-grid">
    <figure>
      <img src="assets/generated_world/music.png" alt="Bird’s-eye view 1">
      <figcaption>A classic music room; Dance</figcaption>
    </figure>
    <figure>
      <video controls src="assets/generated_world/music.mp4"></video>
      <figcaption>Close-up View of Motion</figcaption>
    </figure>
  </div>

  <h2 class="section-large">Velocity Estimation from Doppler Effects</h2>
  <p>
    In this experiment, we simulate a rigid sphere moving back and forth along a straight line with sinusoidal velocity.
    A radar is positioned in front of the sphere, and velocity is estimated from Doppler shifts.
    This task requires precise tracking of phase changes induced by motion across different timestamps.
    The results are visualized as range–velocity maps at each timestamp, where we expect to observe a sinusoidal velocity pattern over time reflecting the sphere’s periodic motion.
    In addition, a narrow velocity band should appear across several range bins, since the spatial extent of the sphere causes multiple ranges to share the same velocity.
    The below video clearly demonstrates that our method, which preserves temporal phase coherence, produces substantially cleaner range-velocity maps compared to conventional ray tracing.
   </p>


   <div class="figure-grid">
    <figure class="full-span">
      <video controls src="assets/Doppler_comparison.mp4"></video>
    </figure>
  </div>

</body>

</html>
