SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Motion Synthesis; Human-scene Interaction; Diffusion Model
Abstract: Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the ability to control motion semantics through text. To address these limitations, we introduce SCENIC, a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes while enabling semantic control through natural language. The key technical challenge lies in simultaneously reasoning about complex scene geometry while maintaining text control. This requires understanding both high-level navigation goals and fine-grained environmental constraints. The model must ensure physical plausibility and precise navigation across varied terrain, while also preserving user-specified text control, such as "carefully stepping over obstacles" or "walking upstairs like a zombie." Our solution introduces a hierarchical scene reasoning approach. At the core of our method is a novel hierarchical scene reasoning framework. It combines two key components: a motion-scene cross-attention block that aligns the human body’s motion features with local scene geometry, enabling precise low-level interactions; and a target point canonicalization module that provides global goal conditioning by normalizing target scene coordinates for high-level guidance. To ensure plausibility and naturalness, we leverage a pre-trained motion diffusion prior and apply scene-constrained noise optimization during sampling, enabling long-horizon motion generation that respects both scene structure and semantic text input. Experiments demonstrate that our novel diffusion model generates arbitrarily long human motions that both adapt to complex scenes with varying terrain surfaces and respond to textual prompts. Additionally, we show SCENIC can generalize to four real-scene datasets.
Supplementary Material: zip
Submission Number: 77
Loading