CL-HCoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models

RSS 2025 Workshop EgoAct Submission2 Authors

29 Apr 2025 (modified: 10 Jun 2025)RSS 2025 Workshop EgoAct SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-based navigation, Vision-Language Models, Autonomous agents
TL;DR: We propose CL-HCoTNav, a VLM-driven ObjectNav framework with hierarchical reasoning and closed-loop feedback, achieving 22.4% better zero-shot navigation performance on unseen scenes and objects in AI Habitat.
Abstract: Visual Object Goal Navigation (ObjectNav) requires a robot to locate and navigate to a target object using egocentric observations. However, generalizing policy behavior to new settings—unseen environments and novel target objects—remains a significant challenge. Traditional end-to-end learning methods exacerbate this issue, relying on memorized latent patterns rather than structured reasoning, which limits their ability to generalize effectively. While some recent approaches leverage foundation models for enhanced reasoning, they often overlook the inherent uncertainty and potential errors in vision-language model (VLM) outputs, lacking mechanisms to detect and correct mistakes during navigation. In this work, we introduce Closed-Loop Hierarchical Chain-of-Thought Navigation (CL-HCoTNav), a VLM-driven ObjectNav framework that integrates structured reasoning and closed-loop feedback into navigation policy learning. To improve generalization, we fine-tune a small-scale pre-trained VLM using multi-turn question-answering (QA) data derived from human demonstration trajectories. This structured dataset enables hierarchical Chain-of-Thought (H-CoT) prompting, systematically extracting compositional knowledge following the human cognitive process of locating a target object through iterative reasoning steps. In addition, we propose a Closed-Loop H-CoT mechanism that incorporates quantifiable detection and reasoning confidence scores into the training loss. Our adaptive weighting strategy guides the model to prioritize high-confidence data pairs during navigation, reducing noise from observations and improving robustness against hallucinated or incorrect reasoning. Extensive experiments in the AI Habitat demonstrate that CL-HCoTNav achieves superior generalization to unseen scenes and novel object categories, outperforming state-of-the-art approaches in ObjectNav success rate (SR) and success weighted by path length (SPL) by 22.4\%.
Submission Number: 2
Loading