Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation

Bowen Huang, Yanwei Zheng, Chuanlin Lan, Dongchen Sui, Xinpeng Zhao, Xiao Zhang, Mengbai Xiao, Dongxiao Yu

Published: 30 Sept 2025, Last Modified: 29 Jan 2026ACM Transactions on Multimedia Computing, Communications, and ApplicationsEveryoneRevisionsCC BY-SA 4.0

Abstract: Traditional Vision-and-Language Navigation (VLN) requires an agent to navigate to a target location solely based on visual observations, guided by natural language instructions. Compared to this task, long-instruction VLN involves longer instructions, extended trajectories, and the need to consider more contextual information for global path planning. As a result, it is more challenging and requires accurately aligning the instructions with the agent’s current visual observations, which is accompanied by two significant issues. Firstly, there is a misalignment between actions. The visual observations of the agent at each step lack explicit action-related details, while the instructions contain action-oriented words. Secondly, there is a misalignment between global instructions and local visual observations. The instructions describe the entire navigation trajectory, whereas the agent’s visual observations only provide localized information about a specific position along the trajectory. To address these issues, this article introduces the Action-Perception Alignment Framework (APAF). In this framework, we first design the Action-Contextual Encoding Module (ACEM), which enriches the agent’s visual perception by encoding potential actions with relative heading and elevation angles. We then propose the Dynamic Instruction Weighting Module (DIWM), which adjusts the importance of instruction words based on the agent’s current visual observations, emphasizing those words most relevant to the agent’s visual observations. Our approach significantly outperforms existing methods, achieving state-of-the-art results with improvements of 8.5% and 4.0% in Success Rate (SR) on the long-instruction R4R and RxR datasets, respectively.

External IDs:doi:10.1145/3748656