Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta; Felipe del Rio; Cristian Hinostroza; Denis Parra; Hans Lobel; Rodrigo Toro Icarte

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 viafasttrackPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications of interpretability

Other Keywords: Vision-Language Models, Mechanistic Interpretability, Generalization, Variable Binding, Shortcut Learning, Multimodal Learning, Out-Of-Distribution

TL;DR: Visual training enhances performance on text-only retrieval tasks by forcing models to replace brittle positional shortcuts with robust symbolic binding mechanisms.

Abstract: We document a phenomenon in which Vision Language Models (VLMs) outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Using interchange interventions, attention knockouts, and linear probes, we causally identify the mechanism underlying this improvement: text-only training converges to positional binding—a shortcut that exploits token positions—whereas image-based training disrupts this shortcut, in part through the translation invariance inherent in visual inputs, forcing the model to adopt symbolic binding based on semantic content. We characterize the circuit-level implementations of each mechanism, identifying a binding signature—a marked surge in attribute decodability at entity positions—that distinguishes explicit from implicit binding and generalizes to large-scale pretrained models. Our results demonstrate how mechanistic interpretability tools can causally link cross-modal training to learned computational strategies, and suggest that visual supervision acts as a mechanism-level regularizer that promotes robust binding in language models.

Submission Number: 588

Loading