Vision Language Models are blind[inline-graphic not available: see fulltext]

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

Published: 2024, Last Modified: 20 May 2025ACCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While large language models with vision capabilities (VLMs), e.g., and , are powering various image-text applications and scoring high on many vision-understanding benchmarks, they are still surprisingly struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. performs the best at 77.84% accuracy, but this is still far from the human expected accuracy of 100% . Across different image resolutions and line widths, VLMs consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close together.Code and data are at: vlmsareblind.github.io.