Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

ACL ARR 2025 May Submission6048 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight persistent limitations in current approaches and demonstrate that diagnostic LLM agents can substantially improve interpretability and reliability. Our agent-based system achieves significant empirical gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multi-modal dialogue systems

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: english

Submission Number: 6048

Loading