Keywords: cognitive dimension, behavior, MLLMs
Abstract: The emergence of multimodal large language models (MLLMs) has led to near-human performance across various multimodal cognitive and reasoning tasks, despite relying solely on next-token prediction objectives. A critical and underexplored question is whether MLLMs trained under this paradigm truly exhibit human-like visual conceptual representations and behaviors during multimodal reasoning. To investigate this, we evaluated MLLMs on the widely-used behavioral task of Odd-One-Out (O1O), revealing a limited predictive accuracy for human choices. To address this discrepancy, we propose a novel approach: instead of merely using raw human behavioral data, we first identified core cognitive dimensions and judgmental bases from human behavioral records in O1O experiments. Subsequently, we fine-tuned Qwen2.5-VL in a data-driven manner, guided by these extracted human core cognitive dimensions, thereby markedly enhancing its behavioral consistency with humans. Intriguingly, we found that models aligned with human cognition not only maintain their generality in downstream tasks but can even achieve performance improvements. Furthermore, searchlight representational similarity analysis (RSA) and cortical projection analyses revealed increased activation in brain regions associated with problem planning and decision-making, such as the prefrontal cortex, in the fine-tuned model. This finding potentially offers a neuroscientific explanation for the observed improvements and human-like alignment.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 11687
Loading