HueManity: Probing Fine-Grained Visual Perception in MLLMs

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models (MLLMs), Fine-Grained Visual Perception, HueManity, Benchmark, MLLM Evaluation, Ishihara-style Stimuli, Perceptual Understanding
TL;DR: HueManity, our new Ishihara-style benchmark, reveals MLLMs' critical failure (3%-33.6% accuracy) in fine-grained visual perception, a task aced by humans and ResNet50, highlighting a key understanding gap.
Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning, yet their foundational understanding of nuanced perceptual details is often overlooked by existing evaluations. To address this, we introduce HueManity, a novel benchmark specifically designed to assess this crucial dimension of MLLM visual understanding. HueManity comprises 83,850 Ishihara-style images with embedded alphanumeric strings, challenging models on precise pattern recognition - a fundamental aspect of visual understanding. Our evaluation of nine MLLMs reveals a profound performance deficit: the best-performing model achieved only 33.6\% accuracy on an 'easy' numeric task and 3\% on a 'hard' alphanumeric task. This starkly contrasts with human (100\% numeric, 95.6\% alphanumeric) and fine-tuned ResNet50 (96.5\% numeric, 94.5\% alphanumeric) performance. These findings uncover a critical gap in MLLMs' fine-grained visual understanding, a limitation not apparent through conventional high-level assessments. HueManity offers a new paradigm for evaluating this specific type of model understanding. We will open-source the dataset and code to foster research towards robust perception in MLLMs.
Submission Number: 37
Loading