Improved CLIP Training Objective on Fine-Grained Tasks: Tackling False Negatives and Data Noise

TMLR Paper5270 Authors

02 Jul 2025 (modified: 28 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite its success in various image-text tasks like zero-shot classification on ImageNet, CLIP has been shown to overlook important details in images and captions. This limitation hinders its performance in fine-grained image-text matching tasks. In this paper, we approach this issue through the lens of false negatives (incorrect negative pairs) and data noise (i.e., mislabeled data), which can prevent the model from learning critical details, especially in downstream tasks with a limited number of classes. To address this, we introduce a new loss term incorporating additional supervision to emphasize true negatives. Additionally, we modify the InfoNCE loss to mitigate the impact of data noise. We show that our new method is provably effective under fewer data assumptions than previous approaches, making it particularly suited to noisy multi-modal data. Using the counting task as an example and CLEVR-Count as the benchmark, we demonstrate the performance improvements achieved by our algorithm without requiring extra labeled data.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Hongsheng_Li3
Submission Number: 5270
Loading