Abstract: Multimodal Large Language Models (MLLMs) have demonstrated promising reasoning capabilities in diverse domains, yet their visual perception skills remain a critical bottleneck. In this study, we first investigate the impact of visual perception errors on visual reasoning questions by analyzing the performance of the model on 150 questions. Our findings reveal that incorrect answers often stem from failures in visual perception. In addition, some correct answers arise from hallucinated visual details. Motivated by these insights, we introduce Do You See Me, a multidimensional, programatically generated and scalable benchmark inspired by human psychology to systematically assess visual perception in MLLMs. Our benchmark consists of seven perception-focused subtasks, each designed with control parameters to modulate task complexity. Additionally, it can be easily extended for new perception tasks and varying complexities. We evaluate multiple state-of-the-art closed source and open source MLLMs and conduct a human study to establish performance baselines. Results indicate that MLLMs perform poorly on visual perception tasks, achieving less than 50\% accuracy on most subtasks. Furthermore, as task complexity increases, MLLM performance declines drastically, while human performance remains stable. A direct comparison between human-rated difficulty and MLLM performance highlights a widening performance gap on more challenging tasks. Our study underscores the urgent need for enhanced visual perception in MLLMs to bridge the gap with respect to human level visual perception across specific dimensions.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal LLMs, visual perception, benchmark dataset
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5452
Loading