Supplementary Material

In this HTML file, we present qualitative results for text and path conditioned human motion generation, 4D world generation, and velocity estimation from Doppler effects, organized consistently with the structure of the Appendix.

Conditional Human Motion Generation

For conditonal human motion generation. We begin with customized conditions to highlight the capabilities of our model, followed by qualitative examples from the test set of the HumanML3D dataset.

Varying Path Lengths

We first fix the text condition to “walk”, maintain the path direction but vary path lengths of 1, 3, 5, and 7 meters. Paths are colored from blue (start) to red (end). The results below show that motions follow both the paths and the text.

Path length: 1 m
Path length: 3 m
Path length: 5 m
Path length: 7 m

Varying Path Directions

We then change the text to “slowly walk” and the path length is fixed. However, we vary path directions at ±90°, ±45° and ±30°. The below results show that the generated human motion moves noticeably slower while still following the path.

Path direction: –90°
Path direction: 90°
Path direction: –45°
Path direction: 45°
Path direction: –30°
Path direction: 30°

Varying Text Descriptions

Now, we adopt the same path direction and length. However, we change the text to: jump, run, walk as if there are stairs in the front, and wave their arms. We observe that the generated human motions closely follow the provided text conditions.

Text: Jump
Text: Run
Text: Walk as if there are stairs in the front
Text: Wave their arms

Random Combinations

Lastly, we evaluate the generalization of the model to random text/path combinations, and we find our model understands and follows the various provided conditions to generate the corresponding motions.

Run; 8 m; 30°
Jump; 2.5 m; –90°
Walk as if there are stairs in the front; 3.5 m; 45°
Wave arms; 2.0 m; –30°

Performance on HumanML3D Test Set

We provide qualitative results on the HumanML3D test set. These ressults highlight the alignment between the generated motions and more complex text and path conditions, demonstrating the model’s ability to produce coherent and contextually accurate human motion. Generating these motions requires the model to jointly interpret the semantics of the text and the spatial constraints of the path, understanding not only what action is being described, but also where and how it should move. Text conditions are included in the subcaptions, and path conditions are visualized as points in the scene.

Text: The person takes a step and waves his right hand back and forth.
Text: A man walks backwards and then stops.
Text: A person walks in a circular motion.
Text: A person bends to the right.
Text: A person begins walking forward first with their left foot, taking wide awkward steps as if they are stepping around or over something; begins walking towards the right and then slowly continues to walk to the left, then continues to walk towards the right coming to a stop off to the right side.
Text: The person was pushed but did not fall.
Text: A figure tip toes around while walking in a slolam like motion.
Text: A person who is walking moves forward taking six confident strides.

Generated 4D World

We present a series of dynamic 4D scenes generated by WaveVerse. We specifically show the alignment between the generated human motion and the environments. While Waveverse can effortlessly generate shorter motions in open or less constrained spaces, we emphasize its ability to handle more challenging scenarios, producing long, coherent motions within visually complex and spatially constrained environments. The generated scene provided below showcases qualitative results in such cases, including narrow hallways, intricate layouts, and long human motions. The generated motions align with the surrounding layout, navigating obstacles and fitting within the scene’s geometry. Interestingly, the motions sometimes appear to interact with the scene, such as in the 5th and 7th examples, even though no explicit interaction is modeled.

Bird’s-eye view 1
A broad gallery; Slowly tour around
Close-up View of Motion
Bird’s-eye view 2
A hallway; Wave the arm
Close-up View of Motion
Bird’s-eye view 3
A zigzag hallway; Navigate
Close-up View of Motion
Bird’s-eye view 4
A keyhole-shaped hallway; Bend to pick something up
Close-up View of Motion
Bird’s-eye view 5
A cozy cabin kitchen; Walk to retrieve items
Close-up View of Motion
Bird’s-eye view 6
A winding corridor; Walk
Close-up View of Motion
Bird’s-eye view 6
A L shape hallway; Quickly Move
Close-up View of Motion

Since WaveVerse does not explicitly model physical interaction between the human and the scene, occasional minor collisions may occur. For instance, in the following two examples, a hand or arm occasionally passes through nearby objects. However, these moments are brief, and they do not significantly affect the motion quality for downstream RF sensing tasks, where precise physical contact modeling is not required.

Bird’s-eye view 1
A chic bathroom; Walk and almost slip
Close-up View of Motion
Bird’s-eye view 2
A U-shaped hallway; Jump
Close-up View of Motion

Lastly, we present an interesting and challenging case where the character performs a dance sequence in a cluttered environment filled with objects. Despite the complexity of the motion, it still manages to avoid collisions with the surrounding objects, showing its spatial awareness. We also observe slight jitter in this example, caused by SMPL fitting artifacts under standard procedures when handling extremely abrupt, high-speed motions. we confirm such artifacts are very rare and occur only in limited cases. Nonetheless, this example highlights the capability of WaveVerse in handling complex, high-dynamic motions within constrained scenes.

Bird’s-eye view 1
A classic music room; Dance
Close-up View of Motion

Velocity Estimation from Doppler Effects

In this experiment, we simulate a rigid sphere moving back and forth along a straight line with sinusoidal velocity. A radar is positioned in front of the sphere, and velocity is estimated from Doppler shifts. This task requires precise tracking of phase changes induced by motion across different timestamps. The results are visualized as range–velocity maps at each timestamp, where we expect to observe a sinusoidal velocity pattern over time reflecting the sphere’s periodic motion. In addition, a narrow velocity band should appear across several range bins, since the spatial extent of the sphere causes multiple ranges to share the same velocity. The below video clearly demonstrates that our method, which preserves temporal phase coherence, produces substantially cleaner range-velocity maps compared to conventional ray tracing.