HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover; Jayant Sravan Tamarapalli; Sahiti Yerramilli; Nilay Pande

HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

Published: 14 Jun 2025, Last Modified: 16 Aug 2025MKLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Non-archive

Keywords: Multimodal Large Language Models (MLLMs), Fine-Grained Visual Perception, HueManity, Benchmark, MLLM Evaluation, Ishihara-style Stimuli, Perceptual Understanding

TL;DR: HueManity, our new Ishihara-style benchmark, reveals MLLMs' critical failure (3%-33.6% accuracy) in fine-grained visual perception, a task aced by humans and ResNet50, highlighting a key understanding gap.

Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6\% accuracy on the numeric 'easy' task and a striking 3\% on the alphanumeric 'hard' task. In contrast, human participants achieved near-perfect scores (100\% and 95.6\%)\footnote{Human evaluations utilized 100-image representative subsets, sampled from the model evaluation sets for each respective task.}, and a fine-tuned ResNet50 model reached accuracies of 96.5\% and 94.5\%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We will open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

Submission Number: 14

Loading