Multi-head CLIP: Improving CLIP with Diverse Representations and Flat Minima

Published: 26 Oct 2023, Last Modified: 13 Dec 2023NeurIPS 2023 Workshop PosterEveryoneRevisionsBibTeX
Keywords: contrastive language image pre-training, diverse representation, sharpness-aware minimization
TL;DR: Multi-head CLIP enhances the CLIP framework by enabling the model to learn multiple representations for each data point, resulting in improved zero-shot performance and better multimodal understanding.
Abstract: Contrastive Language-Image Pre-training (CLIP) has shown remarkable success in the field of multimodal learning by enabling joint understanding of text and images. In this paper, we introduce a novel method called Multi-head CLIP, inspired by Stein Variational Gradient Descent (SVGD) and Sharpness-aware Minimization (SAM). Our approach aims to enhance CLIP's learning capability by encouraging the model to acquire diverse features while also promoting convergence towards a flat loss region, resulting in improved generalization performance. We conduct extensive experiments on two benchmark datasets, YFCC15M and CC3M, to evaluate the effectiveness of our proposed method. The experimental results consistently demonstrate that multi-head CLIP outperforms both the original CLIP architecture and CLIP with the SAM optimizer.
Submission Number: 63