Where to Fuse in the VLM Era: A Survey on Integrating Knowledge into Object Goal Navigation

Published: 08 Oct 2025, Last Modified: 08 Oct 2025HEAI 25 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Object Goal Navigation, Vision-Language Model, Large-Language Model, Socially Interactive Navigation
Abstract: The rapid advancement of robotics and deep learning has accelerated the use of Embodied AI, where robots autonomously explore and reason in complex realworld environments. With growing demand for domestic service robots, efficient navigation in unfamiliar settings is crucial. Object Goal Navigation (ObjectNav) is a fundamental task for this capability, requiring a robot to find and reach a user-specified object in an unknown environment. Solving ObjectNav demands advanced perception, contextual reasoning, and effective exploration strategies. Recent Vision-Language Models (VLMs) and Large Language Models (LLMs) provide agents with external common knowledge and reasoning capabilities. This paper poses the critical question: “Where should VLM/LLM knowledge be fused into Object Goal Navigation?” Adapted from the Perception–Prediction–Planning paradigm in autonomous driving, we categorize knowledge integration into these three stages, offering a structured survey of object-goal navigation approaches shaped by the VLM era. We conclude by discussing current dataset limitations and future directions, such as socially interactive navigation.
Submission Number: 8
Loading