Towards understanding the modality gap in CLIP

Peiyang Shi; Michael C. Welle; Mårten Björkman; Danica Kragic

Towards understanding the modality gap in CLIP

Peiyang Shi, Michael C. Welle, Mårten Björkman, Danica Kragic

Published: 06 Mar 2023, Last Modified: 01 May 2023MRL 2023Readers: Everyone

Keywords: CLIP, modality gap, multimodality, contrastive learning, NT-Xent, SLIP

TL;DR: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods

Abstract: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap.

0 Replies

Loading