Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation

Where Am I? Exploring the Situational Awareness Capability of Vision-Language Models in Vision-and-Language Navigation

ACL ARR 2025 February Submission8126 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Intuitively, it is important for humans to localize themselves by understanding their surroundings when navigating to a place, especially when the trajectory is long and complex. Similarly, we believe that this kind of capability, which we call situational awareness, is also crucial for developing better navigation agents. This work aims to explore the situational awareness capability of current popular vision-language model (VLM) based navigation agents in the context of vision-and-language navigation (VLN). We contribute a new dataset, the \emph{Situational Awareness Dataset (SAD)}, comprised of around 100K 360-degree panoramic images and corresponding instructions for this task. We then evaluate multiple prominent VLMs including OpenAI o1, GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL on this dataset. Our results show that the situational awareness capability of these models is far behind human performance, highlighting substantial opportunities for progress and enhancement in this field. We hope that this work will spark future research to improve navigation agents and VLMs, particularly in their ability to process panoramic image data effectively.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: situational awareness, vision-language model, vision-language navigation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English, Hindi, Telugu

Submission Number: 8126

Loading