Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
Abstract: Recent studies have shown that deep vision-only and language-only models—trained on disjoint modalities—nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, and whether it endures the many-to-many nature of real image–text relationships.
In this work, we systematically investigate these questions. We show that representational alignment emerges most strongly in mid-to-late layers of both vision and language models, suggesting a hierarchical progression from modality-specific to conceptually shared representations. Second, this alignment is robust to appearance-only changes but collapses when semantic content is altered—e.g., object removal in images or word order shuffling that disrupts thematic roles in sentences—highlighting that the shared code is truly semantic rather than form-based. Critically, we move beyond the conventional one-to-one image-caption paradigm to investigate alignment in many-to-many contexts, acknowledging that neither modality uniquely determines the other. Using a forced-choice "Pick-a-Pic" task, we find that human preferences for image-caption matches are mirrored in the learned embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions similar to human judgments.
Surprisingly, aggregating embeddings across multiple images or phrases referring to the same concept amplifies alignment. Rather than "blurring" representational detail, aggregation appears to distill a more universal semantic core. Together, these results demonstrate that vision and language networks converge on a shared semantic code, where the alignment mirrors human judgements, and becomes more pronounced when multiple exemplars of the same concept within a single modality are averaged in representational space. Our work provides compelling evidence for a universal code of meaning that transcends modality, offering critical insights into how neural networks represent and align semantic information across the vision-language divide.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: sentence embedding, Image Text Matching, cross-modal information extraction, semantics
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 6402
Loading