Abstract: Vision-and-language Navigation (VLN) is a challenging problem that requires agents to follow natural language instructions in a photo-realistic environment. The alignment between visual object information and instruction object information is critical for the navigational capabilities of intelligent agents. However, most reinforcement learning policies primarily focus on the agent’s distance change to the target viewpoint as the direct reward after taking an action, with object information playing a minor role in classical reinforcement learning for VLN. To address this limitation, we construct a new reward shaping that incorporates both the changes in the agent’s distance to the target and the progress made in navigating according to the given instruction. To capture the navigation progress, we propose an object alignment method that aligns the visual object information observed by the agent with the object information specified in the instructions. By leveraging the object’s position within the navigation instruction, we estimate the agent’s approximate progress during navigation. Experimental results demonstrate the effectiveness of our approach in reducing the navigation error (NE) and achieving high performance in terms of the success rate weighted by path length (SPL). Our method significantly enhances the agent’s ability to accurately follow natural language instructions to reach the intended destination, while also exhibiting improved generalization in unseen environments.
Loading