Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation
Abstract: Intuitively, it is important for humans to localize themselves by understanding their surroundings when navigating to a place, especially when the trajectory is long and complex. Similarly, we believe that this kind of capability, which we call situational awareness, is also crucial for developing better navigation agents. This work aims to explore the situational awareness capability of current popular vision-language model (VLM) based navigation agents in the context of vision-and-language navigation (VLN). We contribute a new dataset, the \emph{Situational Awareness Dataset (SAD)}, comprised of around 100K 360-degree panoramic images and corresponding instructions for this task. We then evaluate multiple prominent VLMs including OpenAI o1, GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL on this dataset. Our results show that the situational awareness capability of these models is far behind human performance, highlighting substantial opportunities for progress and enhancement in this field. We hope that this work will spark future research to improve navigation agents and VLMs, particularly in their ability to process panoramic image data effectively.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: situational awareness, vision-language model, vision-language navigation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English, Hindi, Telugu
Submission Number: 8126
Loading