Keywords: Vision-based Navigation, Deep Learning
Abstract: Open-world navigation requires robots to make decisions in complex, dynamic environments and adapt to flexible task requirements. Traditional approaches often rely on hand-crafted goal metrics and struggle to generalize beyond specific tasks. Recent advances in vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but they typically require interactive training or large-scale data collection with a mobile agent. We frame navigation as a discrete sub-goal identification problem and extend our previous work, FrontierNet—a learning-based exploration system that detects and localizes frontiers directly from visual cues. We integrate FrontierNet with pre-trained vision-language models (VLMs) through a set-of-mark prompting strategy, enabling direct zero-shot, general-purpose navigation from natural language instructions. FrontierNet achieves state-of-the-art performance in autonomous exploration, and when combined with a VLM, demonstrates zero-shot adaptation across a variety of semantic tasks, such as object search—without requiring any additional training or map updating.
Submission Number: 7
Loading