What You See is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Lingfeng Zhang, Xiaoshuai Hao, Haoxiang Fu, Shuyi Zhang, Qiang Zhang, Rui Liu, Long Chen, Wenbo Ding

Published: 12 Nov 2025, Last Modified: 12 Nov 2025AAAIEveryoneCC BY 4.0

Abstract: Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding, which is essential for agents to perceive their surroundings, adapt to intricate layouts, and make informed decisions based on spatial relationships. To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents' navigation capabilities, enabling more effective interactions in real-world scenarios. To support this task, we have generated a spatial navigation dataset consisting of 10,000 trajectories within the AI2THOR simulator, with 5,000 trajectories allocated to each component. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. By offering diverse scenarios and rich contextual information, this dataset aims to facilitate improved learning and adaptability for embodied agents in complex environments. Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework designed to embody the principle of "What You See is What You Reach." SpNav employs a vision-language model (VLM) to interpret high-level human instructions and accurately identify target objects or areas within the observation range. It subsequently achieves precise point-to-point navigation using a spatial map, thereby successfully completing the spatial navigation task. This framework enhances the agent's ability to operate effectively in complex environments, bridging the gap between perception and action. Extensive experiments demonstrate that SpNav not only achieves state-of-the-art performance in spatial navigation tasks, surpassing all baseline methods, but also showcases remarkable zero-shot simulation-to-reality transfer capabilities, highlighting its potential for real-world deployment and practical applications in embodied AI. To support ongoing research in this field, we will release the dataset, benchmark, and source code, enabling the community to build upon our work and explore new avenues for advancement.