Keywords: CLIP, modality gap, multimodality, contrastive learning, NT-Xent, SLIP
TL;DR: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods
Abstract: This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap.
0 Replies
Loading