TRANSBIND: Explainable Compositional Grounding in Large Vision–Language Models via Agentic Counterfactual Diagnostics

TRANSBIND: Explainable Compositional Grounding in Large Vision–Language Models via Agentic Counterfactual Diagnostics

ACL ARR 2026 January Submission10732 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large vision-language models, compositional grounding, counterfactual diagnostics, minimal pairs, hallucination, interpretability, explainability, attribution, multimodal reasoning, evaluation benchmark

Abstract: Large vision-language models (LVLMs) are widely used for visual question answering, instruction following, and grounded generation, yet their reliability is undermined by fluent statements not supported by the image—often incorrect attributes, relations, or quantities. We argue that many such errors are compositional grounding failures: the model detects plausible entities but fails to bind the right attributes and relations to the right visual evidence. We present TransBind, a framework that makes compositional grounding testable and explainable at the level of semantic components. Its core is an agentic diagnostic suite, DiagTrans, which generates and verifies counterfactual minimal pairs that intervene on a single semantic factor (e.g., color, spatial relation, negation) while holding other factors fixed. Building on these contrasts, we introduce two operational metrics: SEC, which quantifies selective edit consistency of model outputs under controlled interventions, and AFC, which measures whether post-hoc explanations (e.g., attribution maps) change in a factor-aligned manner. We also describe a lightweight structured binding module compatible with modern LVLM backbones, encouraging explicit role–filler binding using object-centric slots and factorized cross-attention. Overall, TransBind connects multimodal grounding, interpretability/explainability, and evaluation resources, offering a path toward LVLMs that are not only more faithful but also more auditable.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal pretraining, image text matching, vision question answering, cross-modal content generation, multimodality, spoken language grounding, speech and vision

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 10732

Loading