Improving Vision-and-Language Navigation with Explicit Sub-Instruction Alignment

Mulang Shi

Improving Vision-and-Language Navigation with Explicit Sub-Instruction Alignment

Mulang Shi

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLN

Abstract: Vision-and-Language Navigation (VLN) tasks require an agent to interpret natural language instructions to navigate complex environments. However, existing methods face difficulties grounding multi-step instructions to intermediate visual observations. Typically, these approaches input the entire instruction at once, forcing the model to implicitly align visual observations with specific instruction segments, complicating the grounding process. To mitigate this issue, we introduce Sub-Aligner, a novel sub-instruction index prediction module designed to explicitly identify the most relevant instruction segment corresponding to the agent’s current visual observations. Additionally, we propose a dual-stage, scene-aware description module that summarizes the agent’s surroundings from both directional and panoramic perspectives, effectively bridging the semantic gap between visual context and complex, multi-step language instructions. Empirical evaluations demonstrate that integrating Sub-Aligner consistently enhances navigation performance across different VLN agents on benchmark datasets, Room-to-Room (R2R) and Room-for-Room (R4R)

Submission Type: Short Research Paper (< 4 Pages)

Submission Number: 89

Loading