Submission Type: Non-archive
Keywords: Multimodal Large Language Models (MLLMs), Fine-Grained Visual Perception, HueManity, Benchmark, MLLM Evaluation, Ishihara-style Stimuli, Perceptual Understanding
TL;DR: HueManity, our new Ishihara-style benchmark, reveals MLLMs' critical failure (3%-33.6% accuracy) in fine-grained visual perception, a task aced by humans and ResNet50, highlighting a key understanding gap.
Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6\% accuracy on the numeric 'easy' task and a striking 3\% on the alphanumeric 'hard' task. In contrast, human participants achieved near-perfect scores (100\% and 95.6\%)\footnote{Human evaluations utilized 100-image representative subsets, sampled from the model evaluation sets for each respective task.}, and a fine-tuned ResNet50 model reached accuracies of 96.5\% and 94.5\%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We will open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.
Submission Number: 14
Loading