Keywords: Visual Perception, Multimodal Large Language Models, Pattern Recognition, Robustness
Abstract: Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans (99.38%, 93.25%) and a fine-tuned ResNet-50 (96.5%, 94.5%). These findings expose a critical weakness in MLLMs’ perceptual grounding, one that remains obscured by conventional benchmarks emphasizing high-level semantics.
Primary Area: datasets and benchmarks
Submission Number: 20426
Loading