Keywords: Spatial reasoning, visual entailment, embodied AI, crowd navigation, multi-step reasoning, spatial relationships, vision-language models, 3D scene understanding, embodied cognition, spatial planning
Abstract: In recent years, many benchmarks have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, with models demonstrating significant accuracy improvements. However, these benchmarks rarely test visual entailment (determining if an image entails its respective text). Furthermore, existing visual entailment datasets use simple images, which prevent a true evaluation of visual understanding. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5,608 image and synthetically generated true/false statement pairs. Using images from the CrowdHuman dataset \cite{shao2018crowdhuman}, COREVQA provokes visual entailment reasoning in challenging, crowded scenes. Our results show that even top-performing VLMs achieve accuracy below 80\%, with other models performing substantially worse (39.98\%-69.95\%). This significant performance gap reveals key limitations in the ability of VLMs to semantically understand crowd-based images and reasoning within each image-text pair. The benchmark's emphasis on spatial relationships and multi-step reasoning processes provides insights into challenges faced by embodied AI systems navigating complex, crowded environments.
Submission Type: Dataset/Benchmark Paper (< 9 Pages)
Submission Number: 77
Loading