HuLE-Nav: Human-Like Exploration for Zero-Shot Object Navigation via Vision-Language Models

ACL ARR 2025 February Submission7934 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Enabling robots to navigate efficiently in unknown environments is a key challenge in embodied intelligence. Human exploration relies on accumulated knowledge, spatio-temporal memory, and scene semantic understanding. Inspired by these principles, we propose HuLE-Nav, a zero-shot object navigation method with two core components: multi-dimensional semantic value maps that emulate human-like memory retention and active exploration mechanisms that mimic human behavior. Specifically, HuLE-Nav utilizes Vision-Language Models (VLMs) and real-time observations to dynamically capture semantic relationships between objects, scene semantics, and spatio-temporal exploration history. This information is then represented and iteratively updated in the multi-dimensional semantic value maps. Using these maps, HuLE-Nav employs active exploration mechanisms that integrate dynamic exploration, replanning, collision avoidance, and target verification, enabling flexible long-term goal selection and real-time adaptation of navigation strategies. Experimental results on the challenging HM3D and Gibson datasets show that HuLE-Nav outperforms the best existing competitors in terms of both success rate and exploration efficiency.

Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, vision question answering, multimodalit
Languages Studied: english
Submission Number: 7934
Loading