CounterBench: A Controllable Counterfactual Testbed Reveals Systematic Reasoning Failures in Vision-Language Models
Track: Track 2: Papers without Workshop Proceedings
Keywords: vision-language models, counterfactual reasoning, benchmark, compositional reasoning, VLM evaluation, cognitive evaluation, counterfactual consistency, synthetic testbed, causal reasoning, spatial reasoning
TL;DR: A zero-cost synthetic benchmark with 550 paired images reveals that even strong VLMs like GPT-4o and Qwen2.5-VL fail systematically on counterfactual consistency, scoring ≤66% on causal reasoning despite near-perfect spatial understanding.
Abstract: Vision-language models (VLMs) achieve impressive accuracy on standard visual question answering benchmarks, yet it remains unclear whether they reason about scenes or merely pattern-match from surface cues. We introduce CounterBench, a fully synthetic, controllable benchmark for evaluating counterfactual consistency in VLMs. For each of 550 test items, we programmatically generate a paired scene: an original and an intervened variant where exactly one semantic property is changed (an object's position, color, count, containment, or a causal link). We pose the same question to both images and measure whether the model's answer changes if and only if the intervention is relevant---the Counterfactual Consistency Score (CCS). Evaluating four state-of-the-art VLMs (GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL-72B, GPT-4o mini), we find that while all models achieve 85--91% single-image accuracy, CCS drops to 80--90%, with dramatic category-specific failures: all models score $\leq$66% on causal arrow-following tasks, GPT-4o mini achieves only 48% CCS on counting, and even the strongest model (Qwen2.5-VL) reaches only 89.5% overall CCS. Crucially, spatial and containment reasoning are near-perfect (97--100%), revealing that failures are selective, not uniform. Our results demonstrate that a cheap, fully controllable testbed---generated in minutes with standard Python libraries---can surface systematic VLM failures invisible to standard benchmarks. We release the full benchmark, generation code, and evaluation pipeline.
Submission Number: 10
Loading