Keywords: Ambiguous,large vision-language model
Abstract: Vision–language models (VLMs) have demonstrated remarkable capabilities in visual recognition and reasoning, in some cases even surpassing human performance on standard benchmarks. However, it remains largely unexplored whether VLMs possess higher-order aspects of human perception, such as abstract interpretation and the capacity to manage cognitive ambiguity, and to what extent. models with more human-aligned interpretive and reasoning abilities. In this paper, we introduce \textbf{AmbiBench}, a benchmark designed to systematically evaluate how VLMs perceive and reason about ambiguous images relative to human interpretations. AmbiBench comprises 2,238 ambiguous images spanning nine categories, including object-level, scene-level, and a newly introduced mixed-ambiguity class, paired with 2,687 carefully constructed visual question–answer pairs. Evaluation of 12 state-of-the-art VLMs reveals substantial limitations: in five categories, models achieve less than half of human accuracy, and on mixed-ambiguity images, most collapse to near-zero performance.
Our study shows that humans flexibly navigate multiple interpretations, shifting between global and local perspectives, whereas VLMs largely rely on dominant features and exhibit restricted perception and reasoning under ambiguity. We further probe the existence of perceptual-switch heads—attention mechanisms that may underlie cognitive ambiguity—using bistable images.
AmbiBench exposes critical gaps in current VLMs’ capacity to handle perceptual ambiguity and establishes a foundation for developing models with more human-aligned interpretive and reasoning abilities.
Primary Area: datasets and benchmarks
Submission Number: 5981
Loading