Keywords: Cognitive Map,Vision-and-Language Navigation,Large-Scale Environment
Abstract: Vision-and-Language Navigation (VLN) requires agents to interpret language instructions and navigate to target locations via visual observations. While progress has been made indoors, large-scale outdoor VLN remains underexplored. The large-scale environment representation as a primary challenge. Though dense maps (e.g., point clouds) enable flexible environment modeling, their high memory usage fails to meet navigation’s real-time needs. Additionally, aligning long language instructions with complex environments remains a notable issue. In this paper, we introduce CogVLN, a novel method for VLN in large-scale environments. First, inspired by how humans encode environments, we propose a method for constructing a cognitive map to represent large-scale environments. This method prioritizes encoding key scenes that embody distinct environmental features, while allocating fewer coding resources to scenes with higher consistency. Subsequently, leveraging the constructed cognitive map, we design three core functional modules: a localization module responsible for identifying start and goal vertex, a path planning module tasked with planning navigation routes, and a navigation module dedicated to carrying out the navigation task. During the navigation process, driven by the multimodal large language model (MLLM), CogVLN expresses scene information and receive user feedback in an interactive manner, and further dynamically adjusts the route accordingly. Experimental results in the CARLA Town01 and Town07 environments demonstrate the remarkable performance of our CogVLN.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 10777
Loading