Aligning Visual Structural Compositionality in Humans & Vision-Language Models

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Re-Align WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 5 pages)
Domain: machine learning
Abstract: An open question across machine learning, neuroscience and cognitive science is whether current foundation models, in particular vision-language models (VLMs), learn representations that reflect human-like compositional processing. While linguistic compositionality is well-studied, the extent to which visual structural compositionality emerges in vision models remains under-explored. Here, we present a representational alignment probing framework that maps VLM embeddings to graph properties derived from human-annotated scene graphs in images and linguistic structures in text. Evaluating CLIP and several of its variants, we observe differences in alignment: While text encoders reliably reflect structural graph properties, vision encoders show limited alignment with visual relational structure. We then propose the GraphCLIP model architecture to more explicitly incorporate visual structural signal, but found no substantial performance improvements on our structural probing tasks.
Presenter: ~Helena_Balabin1
Submission Number: 69
Loading