Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation
Abstract: Traditional Vision-and-Language Navigation (VLN) requires an agent to navigate to a target location solely based on visual observations, guided by natural language instructions. Compared to this task, long-instruction VLN involves longer instructions, extended trajectories, and the need to consider more contextual information for global path planning. As a result, it is more challenging and requires accurately aligning the instructions with the agent’s current visual observations, which is accompanied by two significant issues. Firstly, there is a misalignment between actions. The visual observations of the agent at each step lack explicit action-related details, while the instructions contain action-oriented words. Secondly, there is a misalignment between global instructions and local visual observations. The instructions describe the entire navigation trajectory, whereas the agent’s visual observations only provide …
Loading