Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Enrico Vompa; Tanel Tammet; Mohit Vaishnav

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: visual-language models, abstract visual reasoning, Bongard problems, representation alignment, contrastive learning

TL;DR: next token + contrastive objective in vlms helps resolve alignment gap between perception and reasoning

Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ``alignment gap'', where most models fail to generatively outperform the linear separability of their own representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. By augmenting standard next-token prediction with a contrastive objective, our method restructures the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract binary classification tasks.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 2167

Loading