Challenges in Visual Entailment for Accessibility

Published: 28 Aug 2025, Last Modified: 28 Aug 2025CV4A11yEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Question Answering, Multimodal, Visual Entailment, VLMs
TL;DR: We propose COREVQA, a novel Visual Question Answering benchmark with crowd-based images and synthetic true or false statements.
Abstract: In recent years, many benchmarks have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test a model's ability to perform visual entailment, for instance, accepting or refuting a hypothesis based on an image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5,608 image and synthetically generated true/false statement pairs. Using images from the CrowdHuman dataset, COREVQA provokes visual entailment reasoning in challenging, crowded scenes. Our results show that even top-performing VLMs achieve accuracy below 80\%, with other models performing substantially worse (39.98\%-69.95\%). This significant performance gap reveals key limitations in the ability of VLMs to reason over certain types of image–question pairs in crowded scenes.
Submission Number: 19
Loading