A Benchmark for Evaluating Structural Ambiguity Resolution in Vision & Language Models

ACL ARR 2025 July Submission1453 Authors

29 Jul 2025 (modified: 13 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Structural ambiguity in natural language, where a single sentence permits multiple meanings arising from syntax hierarchy, is a crucial challenge for language understanding. Visual context offers a valuable source of additional information for resolving such ambiguity, making Vision \& Language Models (VLMs) a promising solution. As a first step towards evaluating the ability of VLMs to capture such structural ambiguity, we constructed a large-scale benchmark covering a variety of ambiguity types and including both classification and generation tasks. Quantitative results on recent models reveal clear limitations, and our analysis identifies persistent challenges in aligning visual and structural semantics, offering insights for future research.
Paper Type: Long
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: compositionality, semantic textual similarity, phrase/sentence embedding, word/phrase alignment
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 1453
Loading