HGWM: Hierarchical Graph-guided World Model for Zero-shot Object Navigation via Scene-Goal Graph Matching
Keywords: Embodied AI
Abstract: Object Goal Navigation, which requires an agent to locate specific objects in unknown indoor environments, remains a fundamental challenge in embodied AI that demands sophisticated spatial-semantic understanding. Although recent Vision-Language Model (VLM) based approaches have shown promise through effective perception and reasoning capabilities, current methods lack systematic world model architectures that can predict environmental states and reduce exploration inefficiency. We introduce HGWM (Hierarchical Graph-guided World Model), a novel navigation framework that integrates dual-graph matching with a unified Spatial-Semantic World Model to enable robust object localization. HGWM constructs complementary graph representations: a goal subgraph encoding LLM-derived spatial knowledge about target objects and room hierarchies, and dynamically maintained scene graphs derived from our persistent Spatial-Semantic World Model. These graphs interact through a dual-matching mechanism that combines implicit VLM-guided semantic alignment with explicit structural correspondence. Our multi-stage exploration strategy adapts dynamically based on the degree of graph matching, transitioning from systematic exploration to focused search and finally target verification. Experiments on HM3D v0.1, v0.2, and MP3D benchmarks demonstrate HGWM's effectiveness, achieving state-of-the-art performance with 59.6% success rate and 31.5% SPL on HM3D v0.1, and 45.8% success rate and 17.3% SPL on MP3D, outperforming previous methods by up to +1.5% SR and +0.3% SPL. Our code will be made public soon.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 18743
Loading