Track: Extended Abstract Track
Keywords: vision language models, generalization failures, representational alignment, multimodal
TL;DR: Vision language models can struggle to generalize due to the vision and language representations having different structures
Abstract: Vision-language models fail at some tasks that are simple for humans, but why? Many failures point to a mismatch between the models' representations. We test this by showcasing a new failure: a VLM can misclassify simple concepts, like a zebra, if they are not explicitly used to align its vision-language representations. This occurs although the vision and language models separately know of the concept and persists for both retrieval and generative models. Thus, alignment is hard due to the language part of the VLM's representations clustering based not on semantics but other features, e.g. the first word in a sentence. We propose a method that can match images and captions without directly translating between their representations and demonstrate that it achieves good performance on a benchmark where VLMs struggle due to representational misalignment, beating models with two orders of magnitude more parameters.
Submission Number: 49
Loading