Boosting Visual-Language Models by Exploiting Hard Pairs

Boosting Visual-Language Models by Exploiting Hard Pairs

TMLR Paper1710 Authors

20 Oct 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large vision and language models, such as Contrastive Language-Image Pre-training (CLIP), have emerged as the industry standard for aligning images with their corresponding textual descriptions. However, to enhance zero-shot recognition, current methods often demand ad- ditional data collection and retraining with the introduced new loss functions, which hinder their application to an already well-trained CLIP model. In this work, we present Helip, a low-cost strategy tailored to enhance the performance of pre-trained CLIP models. This is achieved by further training them with challenging text-image pairs selected from their training dataset. Our proposed Hard Pair Mining (HPM) method treats a text-image pair as a single point in the joint Vision-Language space and identifies those in close proximity to a given pair as its hard pairs. By incorporating these challenging data, we refine pretrained CLIP models using both the traditional contrastive alignment loss and the newly intro- duced Hard Negative Margin Loss (HNML). This approach ensures the optimal harnessing of insights from challenging data. Notably, Helip is designed to be seamlessly integrated with existing models, providing an enhancement without the need for training a model from scratch or collecting additional data. On a comprehensive zero-shot and retrieval benchmark, Helip consistently boosts existing models to achieve leading performance. In particular, for ImageNet zero-shot accuracy, Helip boosts CC3M and CC12M pretrained SLIP by 3.05 and 4.47 respectively. In addition, the systematic evaluations of zero-shot and linear probing experiments across fine-grained classification datasets demonstrate a consistent performance improvement and validates the efficacy of Helip. Specifically, Helip boosts the zero-shot performance of pretrained CLIP and SLIP by an average of 8.4% and 18.6%, respectively, and improves their linear probe performance by an average of 9.5% and 3.0%.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Changyou_Chen1

Submission Number: 1710

Loading