Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

ACL ARR 2025 May Submission6402 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent studies have shown that deep vision-only and language-only models—trained on disjoint modalities—nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, and whether it endures the many-to-many nature of real image–text relationships. In this work, we systematically investigate these questions. We show that representational alignment emerges most strongly in mid-to-late layers of both vision and language models, suggesting a hierarchical progression from modality-specific to conceptually shared representations. Second, this alignment is robust to appearance-only changes but collapses when semantic content is altered—e.g., object removal in images or word order shuffling that disrupts thematic roles in sentences—highlighting that the shared code is truly semantic rather than form-based. Critically, we move beyond the conventional one-to-one image-caption paradigm to investigate alignment in many-to-many contexts, acknowledging that neither modality uniquely determines the other. Using a forced-choice "Pick-a-Pic" task, we find that human preferences for image-caption matches are mirrored in the learned embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions similar to human judgments. Surprisingly, aggregating embeddings across multiple images or phrases referring to the same concept amplifies alignment. Rather than "blurring" representational detail, aggregation appears to distill a more universal semantic core. Together, these results demonstrate that vision and language networks converge on a shared semantic code, where the alignment mirrors human judgements, and becomes more pronounced when multiple exemplars of the same concept within a single modality are averaged in representational space. Our work provides compelling evidence for a universal code of meaning that transcends modality, offering critical insights into how neural networks represent and align semantic information across the vision-language divide.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: sentence embedding, Image Text Matching, cross-modal information extraction, semantics

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 6402

Loading