Keywords: Multimodal Conceptual Structure, Benchmark, Evaluation
Abstract: Do multimodal LLMs (MLLMs) understand concepts structurally like humans?
Inspired by cognitive principles, we formalize \emph{multimodal conceptual structure} (MCS) as the relational organization grounded in concept-attribute bindings, inter-concept relations, and hypothetical transformations across vision-language space.
We introduce \textsc{MCSBench}, a diagnostic benchmark of 661 questions across seven tasks spanning four cognitive levels (from perceptual grounding to metacognitive verification), paired with 518 structural-integrity probes.
We further propose the Structural Alignment Index (SAI), an integrity-aware metric that awards credit only when both the answer is correct and the underlying reasoning is structurally sound.
Evaluating ${\sim}80$ MLLMs, we find that performance degrades sharply with cognitive depth, and SAI exposes structural brittleness that accuracy alone conceals.
Notably, in-context golden evidence yields substantially smaller integrity gains than accuracy gains on difficult/all questions, suggesting that retrieval augmentation may inflate correctness without facilitating genuine structural reasoning.
Additional analyses, such as item-response modeling, scaling analysis, and reasoning-graph diagnosis, validate that \textsc{MCSBench} provides a reliable, fine-grained diagnostic lens into conceptual-structure failures that existing benchmarks overlook. We will release our dataset and artifacts upon acceptance.
Submission Number: 31
Loading