Representational Space Alignment for Vision-Language Models
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: While recent vision–language models (VLMs) achieve impressive performance across diverse benchmarks, a substantial modality gap persists due to the distinct inductive biases of their visual and textual backbones in data, architecture, and training objectives. Prior efforts primarily enforce cross-modal alignment by aligning visual and textual representations of the same semantics (e.g., "a green apple" as an image or as a caption) to exhibit high angular similarity. However, such exact alignment can suppress modality-specific information and limit the flexibility or expressiveness of the learned representations.
In this work, we relax such alignment constraint and instead focus on aligning the geometric structure in vision latent space and language latent space. Inspired by vector arithmetic phenomena in word embeddings and linear function vectors in large language models, we propose a straightforward but effective Representational Space Alignment (RSA) loss that encourages the relative geometry of the vision latent space to mirror that of the language latent space. Empirically, we show that (1) unimodal backbones in existing VLMs exhibit weak structural alignment, particularly across layers; Plus, we find that unimodal backbones of VLMs align the best in their last layers. (2) VLMs trained with RSA loss not only reach better cross-modal alignment, but also reach high alignment faster; and (3) VLMs trained with RSA loss achieve consistent gains on fine-grained visual reasoning and perception benchmarks, including MME, MMBench, RealWorldQA, and OK-VQA. Moreover, RSA enhances data efficiency, enabling strong performance under limited training data. These results highlight representational structure alignment as a promising new direction for building more coherent and manipulable vision–language representations.
Presenter: ~Tian_Yun2
Submission Number: 92
Loading