Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Published: 05 Apr 2024, Last Modified: 16 Apr 2024VLMNM 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scene graphs, large language models, grounding, open-vocabulary, mobile manipulation, interactive object search, real world, zero-shot
TL;DR: We demonstrate grounding language models using dynamic scene graphs for zero-shot, open-vocabulary interactive object search and mobile manipulation by testing it in large-scale environments in simulation as well as the real world.
Abstract: To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. Importantly, we demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. The resulting approach is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. Through extensive experiments in both simulation and the real world, we demonstrate substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches. We make the code publicly available at http://moma-llm.cs.uni-freiburg.de.
Supplementary Material: zip
Submission Number: 24
Loading