Visual serial processing deficits explain divergences in human and VLM reasoning

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision language models, visual reasoning, geometric reasoning, serial processing, cognitive science
Abstract: Why do Vision Language Models (VLMs), despite strong benchmark performance, often fail on surprisingly simple visual reasoning tasks? We hypothesize that this gap reflects a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance on three domains that systematically vary serial processing load: geometric reasoning (via concept complexity), enumeration (via individuation demands), and mental rotation (via transformation difficulty). In each domain, decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing---whether composing concepts, enumerating items, or performing mental transformations---the VLM-human performance gap widens reliably. These findings support our hypothesis that limits in serial, visually grounded reasoning form a fundamental bottleneck distinguishing current VLMs from humans.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 22815
Loading