Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: Vision-language models exhibit many surprisingly simple failures, but why these failures occur remains unclear. We conjecture that their source is representational misalignment in the backbone's vision and language representations. We demonstrate a new generalization failure that would not occur if the representations were easily alignable, followed by a set of theory-grounded experiments further showing that the representations cannot be aligned using any linear transform. The representations are not expected to be better aligned with sufficient scale due to each modality containing inherently different information. Modern paradigms, such as reasoning or in-context learning, do not alleviate existing failures either. These results suggest that existing paradigms are incapable of preventing these failures and falsify a strong version of the Platonic Representation Hypothesis -- that sufficiently powerful models trained in different modalities should converge to equivalent representations.
Presenter: ~Yonatan_Gideoni1
Submission Number: 37
Loading