Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Published: 02 Mar 2026, Last Modified: 14 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: Vision-Language Models, Mechanistic Interpretability, Generalization, Variable Binding, Shortcut Learning, Multimodal Learning, Out-Of-Distribution
TL;DR: Visual training enhances performance on text-only retrieval tasks by forcing models to replace brittle positional shortcuts with robust symbolic binding mechanisms.
Abstract: Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model’s internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross‑modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 53
Loading