Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Published: 05 Apr 2024, Last Modified: 16 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: hierarchical 3D scene graphs, large language models, robot navigation, open-vocabulary, real world, zero-shot
TL;DR: We propose hierarchical, open-vocabulary 3D scene graphs for long-horizon language-grounded robot navigation from long queries, make contributions in open-set 3D mapping and test our system in the real-world multi-floor environments.
Abstract: Typically, robotic mapping relies on highly accurate dense representations obtained via approaches to simultaneous localization and mapping. While these maps allow for point/voxel-level features, they do not provide language grounding within large-scale environments due to the sheer number of points. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for robot navigation. Using open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary maps in 3D. We then perform floor as well as room segmentation and identify room names. Finally, we construct a 3D scene graph hierarchy. Our approach is able to represent multi-story buildings and allows robots to traverse them by providing feasible links among floors. We demonstrate long-horizon robotic navigation in large-scale indoor environments from long queries using large language models based on the obtained scene graph tokens and outperform previous baselines.
Supplementary Material: zip
Submission Number: 46
Loading