CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, CLIP, compositionality, binding
TL;DR: CLIP binds concepts, but only uni-modally.
Abstract: CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. Our key finding is that CLIP does not lack binding information. Through linear probing, robustness tests with increasing object counts, and conjunctive search experiments, we show that attribute–object bindings are already encoded within CLIP’s text and image embeddings. The weakness lies in the cross-modal alignment, which fails to preserve this information. We show it can be accessed cross-modally with a simple linear transformation to text embeddings. This improves CLIP’s attribute-object binding performance and confirms that the information was already encoded unimodally. In practice, this means CLIP-based systems can be enhanced with a lightweight linear layer trained on existing embeddings, avoiding costly encoder retraining.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 4454
Loading