Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Memory System

16 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language navigation, memory system, plug-and-play, embodied intelligence
TL;DR: Mem4Nav boosts urban VLN with a hierarchical memory system , using a sparse octree and a semantic graph for significantly improved spatial reasoning and recall.
Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce **Mem4Nav**, a hierarchical spatial–cognition memory system that can augment most of the VLN backbones. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing environmental context in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) losslessly compresses historical observations, while short-term memory (STM) caches recent entries for real-time local planning. At each step, the agent dynamically retrieves from STM for immediate context or queries LTM to reconstruct deep history as needed. When evaluated on the Touchdown and Map2Seq benchmarks, Mem4Nav demonstrates substantial performance gains across three distinct backbones (modular, LLM-based, and MLLM-based). Our method improves Task Completion by up to 13.3 percentage points and enhances path fidelity (nDTW) by more than 12 percentage points, while also reducing the final goal distance. Extensive ablation studies confirm the indispensability of both the hierarchical map and the dual memory modules. Our code is open-sourced via \url{https://anonymous.4open.science/r/anonymous_Mem4Nav-62B0/}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7410
Loading