Misalignment Between Vision-Language Representations in Vision-Language Models

Yonatan Gideoni; Yoav Gelberg; Tim G. J. Rudner; Yarin Gal

Misalignment Between Vision-Language Representations in Vision-Language Models

Yonatan Gideoni, Yoav Gelberg, Tim G. J. Rudner, Yarin Gal

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Extended Abstract Track

Keywords: vision language models, generalization failures, representational alignment, multimodal

TL;DR: Vision language models can struggle to generalize due to the vision and language representations having different structures

Abstract: Vision-language models fail at some tasks that are simple for humans, but why? Many failures point to a mismatch between the models' representations. We test this by showcasing a new failure: a VLM can misclassify simple concepts, like a zebra, if they are not explicitly used to align its vision-language representations. This occurs although the vision and language models separately know of the concept and persists for both retrieval and generative models. Thus, alignment is hard due to the language part of the VLM's representations clustering based not on semantics but other features, e.g. the first word in a sentence. We propose a method that can match images and captions without directly translating between their representations and demonstrate that it achieves good performance on a benchmark where VLMs struggle due to representational misalignment, beating models with two orders of magnitude more parameters.

Submission Number: 49

Loading