VISaGE: Understanding Visual Generics and Exceptions

VISaGE: Understanding Visual Generics and Exceptions

ACL ARR 2025 May Submission6953 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While Vision Language Models (VLMs) are trained to learn conceptual representations (generalized knowledge across many instances), they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade-off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that VLMs are typically dominated by the semantic prior, which arises from the language modality, when answering queries about instances. In contrast, conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, vision question answering, cross-modal information extraction,

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 6953

Loading