3DSMaps: Zero-shot Learning of 3D Semantic Maps for Open-vocabulary Mobile Manipulation in Dynamic Environments

Dicong Qiu, Wenzong Ma, Jiadi You, Zhuoyun LIU, Hui Xiong, Junwei Liang

Published: 26 Jun 2024, Last Modified: 05 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: Robots in real-world applications frequently face challenges posed by dynamic and unencountered environments, where objects may be repositioned irregularly and unseen previously. This necessitates that a robot not only possesses open-vocabulary mobile manipulation capabilities, but also understands the semantics of its environment, even in the presence of dynamic changes, ideally in a zero-shot manner. To tackle these challenges, we propose a novel module-based approach that integrates the zero-shot detection and grounded recognition capabilities of pretrained visual-language models (VLMs) with dense 3D entity reconstruction. This combination enables the robot to learn 3D semantic structural knowledge of its environment, represented in 3D Semantic Maps (3DSMaps). Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and interpret human instructions for zero-shot OVMM tasks under dynamic environment settings, with robust replanning capabilities to handle exceptions as they arise.