Track: Extended Abstract Track
Keywords: multimodal language models, vision capabilt
TL;DR: we find MLLM underperform their own vision-backbones on perception tasks, but their language grounding boosts visual reasoning.
Abstract: Multimodal language models (MLLMs) have recently emerged as versatile models that unify visual perception with language understanding. However, their performance across core vision tasks remains poorly characterized relative to the traditional vision backbones---on which they are built. In this work, we provide a systematic comparison of MLLMs and their underlying vision backbones across a diverse set of benchmarks. Our analysis reveals a consistent gap:
*MLLMs underperform their own vision-backbone on perception tasks such as object recognition with deficits of 10-15\% in accuracy.* On the other hand, MLLMs demonstrate considerable gains in reasoning-heavy tasks, such as counting and relational understanding, where language grounding provides complementary benefits. One reason for this discrepancy lies in the limitations of current evaluation practices. Unlike Vision-Language Models (VLMs), MLLMs are evaluated through open-ended text generation, making results more sensitive to formatting errors and instruction-following failures rather than core visual competence. Finally, to encourage research into the vision-capabilities of MLLMs we provide a reduced set of evaluations requiring modest resources while maintaining diagnostic value.
Submission Number: 107
Loading