TRANSBIND: Explainable Compositional Grounding in Large Vision–Language Models via Agentic Counterfactual Diagnostics
Keywords: large vision-language models, compositional grounding, counterfactual diagnostics, minimal pairs, hallucination, interpretability, explainability, attribution, multimodal reasoning, evaluation benchmark
Abstract: Large vision-language models (LVLMs) are widely used for visual question answering, instruction following, and grounded generation, yet their reliability is undermined by fluent statements not supported by the image—often incorrect attributes, relations, or quantities. We argue that many such errors are compositional grounding failures: the model detects plausible entities but fails to bind the right attributes and relations to the right visual evidence. We present TransBind, a framework that makes compositional grounding testable and explainable at the level of semantic components. Its core is an agentic diagnostic suite, DiagTrans, which generates and verifies counterfactual minimal pairs that intervene on a single semantic factor (e.g., color, spatial relation, negation) while holding other factors fixed. Building on these contrasts, we introduce two operational metrics: SEC, which quantifies selective edit consistency of model outputs under controlled interventions, and AFC, which measures whether post-hoc explanations (e.g., attribution maps) change in a factor-aligned manner. We also describe a lightweight structured binding module compatible with modern LVLM backbones, encouraging explicit role–filler binding using object-centric slots and factorized cross-attention. Overall, TransBind connects multimodal grounding, interpretability/explainability, and evaluation resources, offering a path toward LVLMs that are not only more faithful but also more auditable.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal pretraining, image text matching, vision question answering, cross-modal content generation, multimodality, spoken language grounding, speech and vision
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 10732
Loading