Keywords: Applications of interpretability
Other Keywords: Vision-Language Models, Mechanistic Interpretability, Generalization, Variable Binding, Shortcut Learning, Multimodal Learning, Out-Of-Distribution
TL;DR: Visual training enhances performance on text-only retrieval tasks by forcing models to replace brittle positional shortcuts with robust symbolic binding mechanisms.
Abstract: We document a phenomenon in which Vision Language Models (VLMs) outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Using interchange interventions, attention knockouts, and linear probes, we causally identify the mechanism underlying this improvement: text-only training converges to positional binding—a shortcut that exploits token positions—whereas image-based training disrupts this shortcut, in part through the translation invariance inherent in visual inputs, forcing the model to adopt symbolic binding based on semantic content. We characterize the circuit-level implementations of each mechanism, identifying a binding signature—a marked surge in attribute decodability at entity positions—that distinguishes explicit from implicit binding and generalizes to large-scale pretrained models. Our results demonstrate how mechanistic interpretability tools can causally link cross-modal training to learned computational strategies, and suggest that visual supervision acts as a mechanism-level regularizer that promotes robust binding in language models.
Submission Number: 588
Loading