Vision and Language Navigation in the Real World via Online Visual Language Mapping

Published: 21 Oct 2023, Last Modified: 27 Oct 2023LangRob @ CoRL 2023 PosterEveryoneRevisionsBibTeX
Keywords: Vision-and-language Navigation, Online Visual Language Mapping, Foundation Models
Abstract: Enhancing mobile robots with the ability to follow language instructions will improve the navigation efficiency in previously unseen environments. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex real world. Directly transferring SOTA navigation policies learned in simulation to the real world is challenging due to the visual domain gap and absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world. Utilizing the powerful foundation models, the proposed framework includes four key components: (1) a large language models (LLMs) based instruction parser that converts a language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a spatial and semantic map of the unseen environment using large visual-language models (VLMs), (3) a language indexing-based localizer that grounds each macro-action description to a waypoint location on the map, and (4) a DD-PPO-based local controller that predicts the action. Evaluated on an Interbotix LoCoBot WX250 in an unseen lab environment, without any fine-tuning, our pipeline significantly outperforms the SOTA VLN baseline in the real world.
Submission Number: 10
Loading