EvalRes: Evaluating VLMs' Sensitivity to Image Resolution and Relative Detail Size

ICLR 2026 Conference Submission22648 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language model, multimodal, evaluation, benchmark
TL;DR: A flexible framework for evaluating VLMs sensitivity and robustness to resolution and aspect ratio related image transformations
Abstract: Visual Language Models (VLMs) have achieved remarkable success across a wide range of Visual Question Answering (VQA) tasks. Yet, they still struggle with high-resolution visual inputs where regions providing key information are relatively small or scenes are highly detailed and cluttered. This limitation stems from the architectural bottlenecks of current vision encoders, which often fail to preserve fine-granular details necessary for precise reasoning. While several approaches have been proposed to address this issue, a systematic evaluation of a model's capacity to process high-resolution content and small-scale visual cues has been lacking. In this work, we introduce a versatile framework to extend benchmarks and propose two novel metrics designed to assess VLMs' scalability across varying image resolutions and aspect ratios. Unlike evaluation with existing benchmarks, which lack consistency in image properties and fail to isolate resolution and aspect ratio effects, our method enables controlled experimentation to disentangle resolution sensitivity from the overall task performance. Our framework not only enables more robust and fair VLM evaluation, but also paves the way for future research into high-fidelity visual understanding. We evaluate several widely used VLMs with the proposed framework, revealing that even state-of-the-art models struggle with higher resolution and non-standard aspect ratios, and that processing small details remains a major challenge.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22648
Loading