Keywords: multimodal representation learning, visual representation learning, CLIP fine-tuning
TL;DR: We conduct an empirical study to systematically explore an approach for better CLIP fine-tuning: construct batches that include image-text pair similarity clusters to increase the difficulty of the negative examples.
Abstract: With the success of CLIP training for learning transferable visual representations, fine-tuning CLIP models on smaller datasets for better downstream performance is an important area of research. A method for improving CLIP models is to increase the difficulty of negative examples. While the majority of research has focused on manually crafting hard negative captions, this strategy requires additional engineering labor, fails to generalize to different domains, and causes additional overfitting. Here, we conduct an empirical study to systematically explore an alternative approach: construct minibatches that include similarity clusters to increase the difficulty of negative examples. We propose a generalized framework, called SimCLIP, for similarity-based CLIP fine-tuning. By enforcing that each minibatch contains clusters of similar examples, SimCLIP fine-tuning can improve model performance compared to standard CLIP fine-tuning. We extensively study which SimCLIP configurations and factors contribute most to downstream performance. We also analyze SimCLIP's performance on rare special sets, compositionality of attributes, and generalization across dataset sizes. Our observations provide a better understanding of similarity-based minibatch construction methods as well as new insights into CLIP fine-tuning.
Submission Number: 48
Loading