SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Our work presents a benchmark for VLM usage in a human-centered task, social robot navigation. Specifically, we evaluate the scene understanding capabilities of VLMs and run preliminary experiments on state-of-the-art VLMs. Our quantitative benchmark sheds some light on the capabilities of VLMs and can help facilitate robot learning in human-centered environments for robot navigation.
Keywords: social robot navigation, scene understanding, vlm, benchmark, robotics, vqa
TL;DR: A VLM benchmark for scene understanding of social robot navigation scenarios.
Abstract: Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding, encompassing spatiotemporal awareness, as well as the ability to interpret human intentions. Recent Vision-Language Models (VLMs) show signs of object recognition, common-sense reasoning, and contextual understanding—capabilities that make them promising for addressing the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can reliably perform the complex spatiotemporal reasoning and intention inference needed for safe and socially compliant robot navigation. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding of real-world social robot navigation scenarios. The benchmark provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still lags behind a simpler rule-based approach and human performance, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs.
Submission Number: 24
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview