Keywords: Multimodal Large Language Models, Spatial Reasoning
Abstract: Humans can imagine and manipulate visual images mentally, a capability known as \textit{spatial visualization}.
While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill.
This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability.
To this end, we introduce \textbf{\textit{SpatialViz-Bench}}, a comprehensive multi-modal benchmark for \textit{spatial visualization} with \emph{12} tasks across \emph{4} sub-abilities, comprising \emph{1,180} programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations.
Our evaluation of \emph{27} Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models.
Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in \textit{spatial visualization} tasks, thereby addressing a significant lacuna in the field.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 3999
Loading