Multiple Visual Features in Topological Map for Vision-and-Language Navigation

Ruonan Liu, Ping Kong, Weidong Zhang

Published: 2024, Last Modified: 15 May 2025IROS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision-and-Language Navigation (VLN) in continuous environments aims to navigate robot agents in unseen environments following natural language instructions. The majority of existing approaches rely on constructing semantic maps or topological maps to record information. However, semantic maps overlook the detailed information of objects and the correspondence among views during navigation, while topological maps lack the spatial representation between entities. To address these limitations, we propose a novel visual feature representation method for continuous VLN, called Multiple Visual Features in Topological Map (MV-Topo). MV-Topo utilizes three distinct visual encoders to extract visual features, which are integrated in the dynamically generated topological map. These fused features actively participate in the subsequent cross-modal planning to derive a long-term path towards a subgoal, effectively guiding the agent to reach the final location. We experimentally demonstrate the effectiveness of our approach and achieve competitive results on the full VLN-CE test splits. Notably, our method outperforms the state-of-the-art by 3.5% in terms of the Navigation Error (NE) metric, indicating that the utilization of multiple visual features significantly enhances the agent’s perception of semantic targets.