Challenges in Visual Entailment for Accessibility

Ishant Yunay Chintapatla; Kazuma Choji; Naaisha Agarwal; Charles Duong; Vasu Sharma; Sean O'Brien; Kevin Zhu

Challenges in Visual Entailment for Accessibility

Ishant Yunay Chintapatla, Kazuma Choji, Naaisha Agarwal, Charles Duong, Vasu Sharma, Sean O'Brien, Kevin Zhu

Published: 28 Aug 2025, Last Modified: 28 Aug 2025CV4A11yEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Question Answering, Multimodal, Visual Entailment, VLMs

TL;DR: We propose COREVQA, a novel Visual Question Answering benchmark with crowd-based images and synthetic true or false statements.

Abstract: In recent years, many benchmarks have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test a model's ability to perform visual entailment, for instance, accepting or refuting a hypothesis based on an image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5,608 image and synthetically generated true/false statement pairs. Using images from the CrowdHuman dataset, COREVQA provokes visual entailment reasoning in challenging, crowded scenes. Our results show that even top-performing VLMs achieve accuracy below 80\%, with other models performing substantially worse (39.98\%-69.95\%). This significant performance gap reveals key limitations in the ability of VLMs to reason over certain types of image–question pairs in crowded scenes.

Submission Number: 19

Loading