Towards understanding the modality gap in CLIPDownload PDF

Published: 06 Mar 2023, Last Modified: 01 May 2023MRL 2023Readers: Everyone
Keywords: CLIP, modality gap, multimodality, contrastive learning, NT-Xent, SLIP
TL;DR: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods
Abstract: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap.
0 Replies

Loading