MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

17 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Reasoning, Chain-of-Thought, Vision-Language Models, Benchmark Evaluation
TL;DR: We introduce MME-CC, a vision-grounded benchmark of 11 tasks spanning spatial, geometric, and knowledge reasoning; models remain weak in spatial/geometric reasoning, show recurring error patterns, and follow a three-stage CoT strategy.
Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (\textbf{M}ulti-\textbf{M}odal \textbf{E}valuation benchmark of \textbf{C}ognitive \textbf{C}apacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information—spatial, geometric, and knowledge-based reasoning—and provides fine-grained analyses of MLLMs’ cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs.\ 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak ($\leq$30\%). We further identify common error patterns---including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions---and observe that Chain-of-Thought typically follows a three-stage process (extract $\rightarrow$ reason $\rightarrow$ verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.
Primary Area: datasets and benchmarks
Submission Number: 9064
Loading