The Double-Ellipsoid Geometry of CLIP

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We reveal that the embedding space of CLIP consist of two ellipsoids, one per modality
Abstract: Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.
Lay Summary: Modern AI models like CLIP can understand both images and text by mapping them into a shared mathematical space. This shared space allows the model to match an image to a caption or vice versa. Our paper uncovers a surprising geometric structure inside this space. Rather than forming a neat, uniform cloud of points, we found that CLIP’s image and text representations each lie on separate, off-center ellipsoid shapes. In simple terms, the way images and text are stored is uneven and misaligned. This misalignment can lead to inefficiencies when the model tries to compare different types of data. We introduce a simple yet powerful measure called conformity, which reveals how “common” or typical an image is in CLIP’s eyes. Images of everyday concepts—like a dog playing in a yard—tend to be embedded near the center of the ellipsoid, while rare or unusual ones—like a specific person or abstract artwork—are placed further away. This insight helps us better understand how CLIP organizes information and could be valuable for improving tasks that involve rare or hard-to-edit cases, such as personalized image generation or targeted editing.
Link To Code: https://github.com/yossilevii100/double-ellipsoid-clip
Primary Area: Deep Learning->Foundation Models
Keywords: CLIP embedding, modality gap, narrow cone effect
Submission Number: 7662
Loading