Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation
Abstract: Intuitively, it is important for humans to localize themselves by understanding their surroundings when navigating to a place, especially when the trajectory is long and complex. Similarly, we believe that this kind of capability, which we call situational awareness, is also crucial for developing better navigational agents. This work aims to evaluate the situational awareness capability of current popular vision-language model (VLM) based navigational agents.
Inspired by the way of humans processing observations, we consider two types of visual inputs to the models: 360-degree panoramic images and egocentric navigational videos. Then we construct a new dataset, \emph{Situational Awareness Dataset (SAD)}, comprised of around 100K such panoramic images and videos and corresponding instructions for this task. We then evaluate multiple prominent VLMs including OpenAI o1, GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL, and their finetuned versions on SAD.
Our results show that the situational awareness capability of these models is far behind human performance, but can be significantly improved by further finetuning. Furthermore, our findings also suggest that fine-grained alignment between observations and instructions is very helpful to the vision-and-language navigation (VLN) task, which is somehow overlooked by the community now.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, vision question answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English, Hindi, Telugu
Submission Number: 6534
Loading