MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Published: 22 Jan 2025, Last Modified: 11 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Visual Details, Attention, Gradients, Bias, Perception, Localization
TL;DR: We study the perception limitation of Multimodal LLMs and propose automatic visual cropping as a scalable and training-free solution to mitigate this limitation.
Abstract: Multimodal Large Language Models (MLLMs) have recently achieved promising performance on visual question answering (VQA)---a fundamental task affecting various downstream applications and domains. Given MLLMs' potential integration into many critical VQA applications, it is important to understand the limits of their perception. In this work, we study whether MLLMs can perceive small details as well as large details in images. In particular, we observe that their accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We further show that this effect is causal by observing that human visual cropping can significantly mitigate this sensitivity. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then construct automatic visual cropping methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to help it better perceive the small visual subject of any question. We study our proposed methods on two MLLMs and seven visual question answering benchmarks, and show that they can significantly improve MLLMs accuracy without requiring any training. Our findings suggest that MLLMs should be used with caution in detail-sensitive applications, and that visual cropping is a promising direction to improve their performance.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4947
Loading