What Do VLMs See? Benchmarking Vision-Language Models on Ambiguous Images

Xueqi Ma; Yanbei Jiang; Yan Li; Shu Liu; Jiayang Ao; Xingjun Ma; Jey Han Lau; Krista A. Ehinger; Sarah Monazam Erfani; James Bailey

What Do VLMs See? Benchmarking Vision-Language Models on Ambiguous Images

Xueqi Ma, Yanbei Jiang, Yan Li, Shu Liu, Jiayang Ao, Xingjun Ma, Jey Han Lau, Krista A. Ehinger, Sarah Monazam Erfani, James Bailey

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Ambiguous，large vision-language model

Abstract: Vision–language models (VLMs) have demonstrated remarkable capabilities in visual recognition and reasoning, in some cases even surpassing human performance on standard benchmarks. However, it remains largely unexplored whether VLMs possess higher-order aspects of human perception, such as abstract interpretation and the capacity to manage cognitive ambiguity, and to what extent. models with more human-aligned interpretive and reasoning abilities. In this paper, we introduce \textbf{AmbiBench}, a benchmark designed to systematically evaluate how VLMs perceive and reason about ambiguous images relative to human interpretations. AmbiBench comprises 2,238 ambiguous images spanning nine categories, including object-level, scene-level, and a newly introduced mixed-ambiguity class, paired with 2,687 carefully constructed visual question–answer pairs. Evaluation of 12 state-of-the-art VLMs reveals substantial limitations: in five categories, models achieve less than half of human accuracy, and on mixed-ambiguity images, most collapse to near-zero performance. Our study shows that humans flexibly navigate multiple interpretations, shifting between global and local perspectives, whereas VLMs largely rely on dominant features and exhibit restricted perception and reasoning under ambiguity. We further probe the existence of perceptual-switch heads—attention mechanisms that may underlie cognitive ambiguity—using bistable images. AmbiBench exposes critical gaps in current VLMs’ capacity to handle perceptual ambiguity and establishes a foundation for developing models with more human-aligned interpretive and reasoning abilities.

Primary Area: datasets and benchmarks

Submission Number: 5981

Loading