Boosting Visual-Language Models by Exploiting Hard Pairs

Boosting Visual-Language Models by Exploiting Hard Pairs

TMLR Paper2323 Authors

03 Mar 2024 (modified: 07 Oct 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contrastive Language-Image Pre-training (CLIP) has become the standard for learning cross-modal representations between images and text. Efforts to improve its capabilities typically demand the collection of additional data and retraining with new loss functions. While effective, the added requirements limit their practical use due to the increased resource and time investments needed. In this work, we present Helip, a cost-effective strategy tailored to enhance the performance of existing CLIP models without the need for training a model from scratch or collecting additional data. Our method allows for effortless integration with existing models’ training pipelines, providing an instant boost by training them with selected challenging text-image pairs from their original training datasets. Helip treats each text-image pair as a single point in the joint vision-language space, identifying those in close proximity as hard pairs. By incorporating the challenging data, pre-trained CLIP models are refined using both the traditional contrastive loss and the newly introduced hard negative margin loss, ensuring the challenging data is fully utilized. On comprehensive benchmarks, Helip consistently boosts existing models to achieve leading performance. In particular, it improves the zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1% respectively, achieved within two epochs of training. In addition, across fine-grained classification datasets, Helip improves the zero-shot performance of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at https://anonymous.4open.science/r/HELIP-7F8E/.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=WWwEvGkJL9&referrer=%5Bthe%20profile%20of%20Haonan%20Wang%5D(%2Fprofile%3Fid%3D~Haonan_Wang1)

Changes Since Last Submission: In response to the previous feedback, we have refined our expressions and included new experiments, analyses, and discussions in this version. 1. We added an experiment on a larger (10x the size of CC3M) open-source dataset, which we refer to as Open29M (combining CC3M, CC12M, and YFCC15M). This extended analysis is detailed in Section 4. The result indicates that HELIP can instantly enhance CLIP's performance on larger datasets. 2. We presented a discussion section about the effectiveness of HELIP with respect to scaled training data in Appendix A.3. The results show that HELIP consistently boosts CLIP’s performance across various training data sizes, with the most significant improvement observed in the Open29M dataset, which is the largest dataset employed in our experiments. 3. We further clarified the implementation and significance of the baselines. Besides, we provide more discussion about those baselines in the appendix. 4. We refined and reorganized several sentences to enhance clarity and more effectively convey the motivation behind our work.

Assigned Action Editor: ~Zhiding_Yu1

Submission Number: 2323

Loading