Keywords: visual-language models, abstract visual reasoning, Bongard problems, representation alignment, contrastive learning
TL;DR: next token + contrastive objective in vlms helps resolve alignment gap between perception and reasoning
Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ``alignment gap'', where most models fail to generatively outperform the linear separability of their own representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. By augmenting standard next-token prediction with a contrastive objective, our method restructures the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract binary classification tasks.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 2167
Loading