Data-Efficient Multi-Modal Contrastive Learning: Prioritizing Data Quality over Quantity

ICLR 2024 Workshop DMLR Submission78 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data selection, clip, multimodal learning, pre-training
TL;DR: Theoretically motivated and empirically successful data selection for pre-training of multi-modal (vision-language) contrastive learning.
Abstract: Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding a subset of image-caption pairs that provably generalizes on par with the full data when trained on, has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that best preserve the cross-covariance of the images and captions of the full data best preserve CLIP's generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets of size 5%-10% found by CLIPCov achieve over 150% and 40% the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets exhibit average relative performance improvement over the next best baseline of nearly 50% across 14 downstream datasets.
Primary Subject Area: Data collection and benchmarking techniques
Paper Type: Extended abstracts: up to 2 pages
DMLR For Good Track: Participate in DMLR for Good Track
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 78
Loading