COREVQA: Spatial Reasoning and Multi-Step Visual Entailment in Crowded Environments

Kazuma Choji; Ishant Yunay Chintapatla; Naaisha Agarwal; Andrew Lwin; Charles Duong

COREVQA: Spatial Reasoning and Multi-Step Visual Entailment in Crowded Environments

Kazuma Choji, Ishant Yunay Chintapatla, Naaisha Agarwal, Andrew Lwin, Charles Duong

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatial reasoning, visual entailment, embodied AI, crowd navigation, multi-step reasoning, spatial relationships, vision-language models, 3D scene understanding, embodied cognition, spatial planning

Abstract: In recent years, many benchmarks have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, with models demonstrating significant accuracy improvements. However, these benchmarks rarely test visual entailment (determining if an image entails its respective text). Furthermore, existing visual entailment datasets use simple images, which prevent a true evaluation of visual understanding. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5,608 image and synthetically generated true/false statement pairs. Using images from the CrowdHuman dataset \cite{shao2018crowdhuman}, COREVQA provokes visual entailment reasoning in challenging, crowded scenes. Our results show that even top-performing VLMs achieve accuracy below 80\%, with other models performing substantially worse (39.98\%-69.95\%). This significant performance gap reveals key limitations in the ability of VLMs to semantically understand crowd-based images and reasoning within each image-text pair. The benchmark's emphasis on spatial relationships and multi-step reasoning processes provides insights into challenges faced by embodied AI systems navigating complex, crowded environments.

Submission Type: Dataset/Benchmark Paper (< 9 Pages)

Submission Number: 77

Loading