HiNav: A Human-Inspired Framework for Zero-Shot Vision-and-Language Navigation

ACL ARR 2025 May Submission3790 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Existing approaches often fail to stop at targets due to incorrect endpoint recognition or fail to reach targets in long-distance tasks. Inspired by human navigation, we devise a solution to these challenges, proposing Human-Inspired Navigation (HiNav), a modular framework that mimics human cognitive processes for efficient navigation. HiNav integrates four components that emulate key human abilities: HiView for optimal viewpoint selection; HiMem for selective memory and map maintenance, enhancing long-range exploration; HiSpace for spatial reasoning and object relationship inference, improving endpoint recognition; and HiDecision for Large Language Model (LLM)-based path planning. We also introduce an Instruction-Object-Space (I-O-S) dataset and fine-tune the Qwen3-4B model into Qwen-Spatial (Qwen-Sp), which outperforms leading commercial LLMs (e.g., GPT-4o, Gemini2.5-Flash, Grok3) in object list extraction, achieving higher F1 and NDCG scores on the I-O-S test set. Extensive experiments on the Room-to-Room (R2R) and REVERIE datasets demonstrate state-of-the-art performance with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL).
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, large language models
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: vision-and-language navigation, human-inspired navigation, large language models
Submission Number: 3790
Loading