It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Abrar Fahim; Alex Murphy; Alona Fyshe

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Abrar Fahim, Alex Murphy, Alona Fyshe

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi Modal Representation Learning, Contrastive Representation Learning, Modality Gap, CLIP

TL;DR: In this paper, we analyze the “contrastive gap” in multi-modal models like CLIP and show that optimizing for uniformity and alignment reduces the gap, improving downstream performance.

Abstract: Learning jointly from images and texts using contrastive pre-training has emerged as an effective method to train large-scale models with a strong grasp of semantic image concepts. For instance, CLIP, pre-trained on a large corpus of web data, excels in tasks like zero-shot image classification, object detection, geolocalization, and more. These contrastive models embed input images and texts into a shared representational space. Recently, it was claimed that models like CLIP show a *modality gap*, where image and text embeddings occupy disjoint areas in the representational space. Previous studies attribute this gap to factors like data artifacts (mismatched pairs), model architecture artifacts (the cone effect), and the nature of the loss landscape (getting stuck in local minima). We demonstrate that, even after accounting for these factors, and even when using the *same modality*, the contrastive loss actually *creates* a gap during training. As a result, we propose renaming this phenomenon the *contrastive gap*. We show that the contrastive gap is exacerbated by training with small batch sizes in high-dimensional spaces, causing embeddings of each modality to occupy small disjoint portions of the latent space. Our experiments show that minimizing the contrastive gap via the addition of uniformity and alignment terms optimizes the representational space and conveys better performance on downstream tasks such as zero-shot image classification and multi-modal arithmetic.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5763

Loading