Keywords: Multimodal Large Language Model, Vision Language Model, Cognition, Evaluation
Abstract: Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing abstract patterns or identifying spatial relationships.
Such deficiencies undermine their efficacy and robustness, rendering high-level downstream applications (e.g., embodied AI) infeasible.
To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment, including four domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning.
Furthermore, we leverage parametric generation to automatically construct unlimited test cases with controllable difficulty for applicable subtests.
Using VisFactor, we evaluate 20 frontier MLLMs, including both proprietary (GPT, Gemini, etc.) and open-source models (LLaMA-3.2, Qwen2.5-VL, etc.).
The best-performing model achieves a score of only 25.19%, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure–ground discrimination—regardless of model size or prompting strategy.
These findings suggest that performance improvements on existing general benchmarks might be castles in the air instead of mastery of human-like visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities.
The dataset and evaluation toolkit will be made publicly available upon publication.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 88
Loading