Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text-reading as a surrogate task for general visual perception to understand how quality, size, distractors, and location of an object can independently affect the ability of MLLMs to perceive it in images. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, even local perturbations of an object by a few pixels can cause a drastic decline in the ability of MLLMs to perceive it. Our study provides a better understanding of the perceptual limitations of MLLMs and contributes new evaluation protocols for analyzing, enhancing perception of future MLLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ### Inclusion of Latest Frontier Model
We added **Qwen-3-VL-8B** to our experiments. While the model exhibits stronger overall performance, the core perceptual limitations and biases identified in prior baselines persist even in this advanced architecture.
### Expanded Robustness Evaluation with Text Variations
We updated the main experiments to incorporate **diverse text fonts, text colors, and background colors**. For cutting-based experiments, a fixed font was used to avoid confounding effects. The key perceptual sensitivity trends remain stable, reinforcing the robustness and generality of our findings.
### Confidence Interval Reporting for Statistical Reliability
We now report **95% confidence intervals** for all experimental groups, visualized via vertical error bars in the updated figures. The intervals are generally narrow, confirming that phenomena such as patch-boundary sensitivity and distractor vulnerability reflect statistically reliable behaviors rather than variance artifacts.
### Regression Analysis on Real-World Datasets
We added a regression analysis to examine whether **location provides additional explanatory power beyond object size**. Comparing size-only models with models including both size and location, ANOVA results show that **Object Location is a statistically significant predictor** for most models and datasets. Full results are included in **Appendix C**.
Assigned Action Editor: ~Changyou_Chen1
Submission Number: 6184
Loading