Not How You Think, It's What You See: Decoupling Perception from Reasoning

ICLR 2026 Conference Submission12733 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Reasoning, Vision-Language Models (VLMs), Evaluation Framework, Evaluation Methodology, Cognitive Paradigms, Perception-Reasoning Interface, Perception Bottleneck, Bongard Problems, Multi-Image Reasoning, Human-centered evaluation, Interpretability
TL;DR: We show VLMs' reasoning is limited by perception, not logic. Our framework decouples these processes by reasoning over text descriptions, and an interactive loop lets the model "look again" at images, unlocking latent reasoning and better performance
Abstract: The ability of Vision-Language Models (VLMs) to reason depends on a complex interplay between visual perception and abstract cognition. While it is widely recognized that perception is a significant bottleneck, systematically diagnosing how it fails and developing methods to unlock latent reasoning capabilities remains a key challenge. To address this, we introduce a cognitively-inspired framework that decomposes VLM behavior through four distinct paradigms: 1) Direct Visual Rule Learning (holistic processing), 2) Deductive Rule Learning (explicit rule extraction), 3) Componential Analysis (CA), which decouples perception by reasoning over task-agnostic textual descriptions, and 4) Interactive Componential Analysis (ICA), which introduces a feedback loop for targeted visual probing. Our framework's emphasis on task-agnostic decomposition and cognitive parallels provides a unique lens for analysis compared to prior decoupling efforts. Applying this framework across an expanded suite of benchmarks, we conduct a comprehensive evaluation on both proprietary and open-source multi-image VLMs. Our results confirm that perception is a primary bottleneck and show that our CA and ICA paradigms yield substantial performance gains, unlocking the latent reasoning abilities of powerful LLMs. Crucially, ICA demonstrates that an interactive loop can resolve fine-grained visual ambiguities that static descriptions cannot, outperforming the non-interactive CA approach. Our work provides a robust diagnostic toolkit for the community and offers concrete architectural insights, demonstrating that interactive, decoupled systems are a promising path toward more general and capable visual intelligence.
Primary Area: causal reasoning
Submission Number: 12733
Loading