Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects

Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects

TMLR Paper6184 Authors

12 Oct 2025 (modified: 30 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text reading as a surrogate for their general perception ability, to understand how object quality, size, distractors, and location independently affect the perception of small objects in MLLMs. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, even local perturbations of an object by a few pixels can cause a drastic decline in the ability of MLLMs to perceive it. Our study provides a better understanding of the perceptual limitations of MLLMs and contributes new evaluation protocols for analyzing, enhancing perception of future MLLMs.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Changyou_Chen1

Submission Number: 6184

Loading