Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

ACL ARR 2025 February Submission5452 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated promising reasoning capabilities in diverse domains, yet their visual perception skills remain a critical bottleneck. In this study, we first investigate the impact of visual perception errors on visual reasoning questions by analyzing the performance of the model on 150 questions. Our findings reveal that incorrect answers often stem from failures in visual perception. In addition, some correct answers arise from hallucinated visual details. Motivated by these insights, we introduce Do You See Me, a multidimensional, programatically generated and scalable benchmark inspired by human psychology to systematically assess visual perception in MLLMs. Our benchmark consists of seven perception-focused subtasks, each designed with control parameters to modulate task complexity. Additionally, it can be easily extended for new perception tasks and varying complexities. We evaluate multiple state-of-the-art closed source and open source MLLMs and conduct a human study to establish performance baselines. Results indicate that MLLMs perform poorly on visual perception tasks, achieving less than 50\% accuracy on most subtasks. Furthermore, as task complexity increases, MLLM performance declines drastically, while human performance remains stable. A direct comparison between human-rated difficulty and MLLM performance highlights a widening performance gap on more challenging tasks. Our study underscores the urgent need for enhanced visual perception in MLLMs to bridge the gap with respect to human level visual perception across specific dimensions.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodal LLMs, visual perception, benchmark dataset

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 5452

Loading