Keywords: Vision-Language Models, Structural Ambiguity, Multimodal Alignment, Visual Grounding, Cross-modal Semantics, Benchmark Dataset, Syntactic Interpretation
Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding.
While visual scenes can provide useful cues for resolving such ambiguity, this requires Vision and Language Models (VLMs) to reliably align each possible interpretation with the corresponding visual scene.
We introduce $\textbf{Vi}$sion and $\textbf{L}$anguage $\textbf{Str}$uctural $\textbf{U}$nderstanding $\textbf{B}$enchmark (ViLStrUB), a benchmark designed to evaluate vision and language alignment under structural ambiguity, consisting of ambiguous captions, their disambiguated interpretations, and corresponding images across seven ambiguity categories.
Using classification-based evaluation settings, we assess a diverse set of contrastive and LLM-based generative VLMs and compare their performance.
Our results show that most models perform near chance level and exhibit large gaps from human performance, revealing persistent limitations in aligning structurally distinct interpretations with visual scenes.
Paper Type: Long
Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas
Research Area Keywords: polysemy, semantic textual similarity, phrase/sentence embedding, corpus creation, benchmarking, evaluation, multimodality, image text matching, linguistic theories
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 7709
Loading