From End-to-End to Step-by-Step: learning Composable Navigation Primitives for Vision-Language Navigation
Keywords: embodied ai, vision language navigation, multimodal large language models
TL;DR: Current VLN systems lack basic navigation skills; we propose a step-by-step training paradigm to equip MLLMs with reliable foundational capabilities for VLN tasks.
Abstract: Recent Vision-Language Navigation (VLN) research with Multi-modal Large Language Models (MLLMs) has broadly adopted end-to-end training on long-horizon instruction datasets. However, human navigation mainly relies on the sequential execution of simple primitives guided by immediate observations. Our analysis shows that, although VLN models reported achieving promising results on long-horizon instructions, they struggle with basic navigation primitives (e.g., move, change region). To the best of our knowledge, we are the first to point out this phenomenon. To address this, we propose a primitive-based paradigm that first learns core skills and then composes them into long-horizon behaviors. We design a unified data pipeline to construct Vision-Language-Move-Base (VLMB), the first controllable benchmark centered on the move-to primitive, covering 206 scenes and 873 object instances. Based on VLMB, we develop Move-to-Anything, a model equipped with a memory mechanism that balances historical context with current observations. Experiments demonstrate that existing VLN models achieve only a 43.8\% success rate in MP3D; our approach reaches 60.6\% in MP3D and 71.4\% in HM3D, exhibiting substantially stronger compositional generalization. These results highlight the effectiveness of primitive-based learning for building robust and generalizable navigation agents.
Supplementary Material: pdf
Primary Area: applications to robotics, autonomy, planning
Submission Number: 5376
Loading