Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Models, Symbolic Binding, External Cues, Mechanistic Interpretability, Hallucination Reduction, Causal Interventions, Modality Gap
TL;DR: We uncover the mechanism of simple external cues in LVLMs and show that they induce Grounding IDs, symbolic identifiers that enhance cross-modal binding, improve attention, and reduce hallucinations.
Abstract: Large vision–language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.
Primary Area: interpretability and explainable AI
Submission Number: 14190
Loading