ViLStrUB: A Benchmark for Vision and Language Alignment under Structural Ambiguity

ViLStrUB: A Benchmark for Vision and Language Alignment under Structural Ambiguity

ACL ARR 2026 January Submission7709 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Structural Ambiguity, Multimodal Alignment, Visual Grounding, Cross-modal Semantics, Benchmark Dataset, Syntactic Interpretation

Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. While visual scenes can provide useful cues for resolving such ambiguity, this requires Vision and Language Models (VLMs) to reliably align each possible interpretation with the corresponding visual scene. We introduce $\textbf{Vi}$sion and $\textbf{L}$anguage $\textbf{Str}$uctural $\textbf{U}$nderstanding $\textbf{B}$enchmark (ViLStrUB), a benchmark designed to evaluate vision and language alignment under structural ambiguity, consisting of ambiguous captions, their disambiguated interpretations, and corresponding images across seven ambiguity categories. Using classification-based evaluation settings, we assess a diverse set of contrastive and LLM-based generative VLMs and compare their performance. Our results show that most models perform near chance level and exhibit large gaps from human performance, revealing persistent limitations in aligning structurally distinct interpretations with visual scenes.

Paper Type: Long

Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas

Research Area Keywords: polysemy, semantic textual similarity, phrase/sentence embedding, corpus creation, benchmarking, evaluation, multimodality, image text matching, linguistic theories

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 7709

Loading