OpenVLM-Nav: Training-Free Zero-Shot Object-Goal Navigation via Vision-Language Guidance

17 Nov 2025 (modified: 29 Dec 2025)ICC 2025 Workshop RAS SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Zero Shot Navigation, Object goal Navigation, Embodied AI, Habitat
TL;DR: OpenVLM-Nav: explores the capability of leveraging open source VLM (w/o domain knowledge) towards zero shot object goal navigation thorugh prompt finetuning.
Abstract: We propose OpenVLM-Nav, a training-free framework for zero-shot object-goal navigation using open-source vision–language models. Using CLIP\cite{xu2021videoclip}, BLIP\cite{li2022blip}, and Qwen3-VL-2B\cite{yang2025qwen3}, the agent interprets object descriptions directly from images without task-specific training. Qwen3-VL-2B performs best, and we further study two extensions: a history module for temporal context and a depth module for geometric cues. Depth provides the largest gain, improving success rate from 0.08 to 0.14 and reducing Distance-to-Goal from 7.824 to 7.567. History gives smaller but consistent improvements. These results show that simple, training-free VLM-based navigation can be strengthened through temporal reasoning and depth information. Related codes will be made public.
Submission Number: 10
Loading