Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation

Bowen Huang, Yanwei Zheng, Chuanlin Lan, Xinpeng Zhao, Xiao Zhang, Mengbai Xiao, Dongxiao Yu

Published: 11 Sept 2025, Last Modified: 13 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Traditional Vision-and-Language Navigation (VLN) requires an agent to navigate to a target location solely based on visual observations, guided by natural language instructions. Compared to this task, long-instruction VLN involves longer instructions, extended trajectories, and the need to consider more contextual information for global path planning. As a result, it is more challenging and requires accurately aligning the instructions with the agent’s current visual observations, which is accompanied by two significant issues. Firstly, there is a misalignment between actions. The visual observations of the agent at each step lack explicit action-related details, while the instructions contain action-oriented words. Secondly, there is a misalignment between global instructions and local visual observations. The instructions describe the entire navigation trajectory, whereas the agent’s visual observations only provide …