Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation

Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation

ACL ARR 2025 May Submission6534 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Intuitively, it is important for humans to localize themselves by understanding their surroundings when navigating to a place, especially when the trajectory is long and complex. Similarly, we believe that this kind of capability, which we call situational awareness, is also crucial for developing better navigational agents. This work aims to evaluate the situational awareness capability of current popular vision-language model (VLM) based navigational agents. Inspired by the way of humans processing observations, we consider two types of visual inputs to the models: 360-degree panoramic images and egocentric navigational videos. Then we construct a new dataset, \emph{Situational Awareness Dataset (SAD)}, comprised of around 100K such panoramic images and videos and corresponding instructions for this task. We then evaluate multiple prominent VLMs including OpenAI o1, GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL, and their finetuned versions on SAD. Our results show that the situational awareness capability of these models is far behind human performance, but can be significantly improved by further finetuning. Furthermore, our findings also suggest that fine-grained alignment between observations and instructions is very helpful to the vision-and-language navigation (VLN) task, which is somehow overlooked by the community now.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision language navigation, vision question answering

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English, Hindi, Telugu

Submission Number: 6534

Loading