Keywords: vision-language, clip, data curation, batch sampling, pretraining
TL;DR: Concept-aware data curation and batch sampling improves the downstream performance of contrastive vision-language models.
Abstract: What data should a CLIP model see? Many data curation efforts aiming to answer
this question center on the quality of a dataset. However, recent work has shown that
while admitting impressive performance benefits, none of these curation methods
are concept-centric, leading to them inheriting the biased properties of web-scale
data distributions. In this work, we go beyond such concept-agnostic methods and
advocate a more flexible online concept-based curation approach. To enable this,
our first contribution is DATACONCEPT, a collection of 128M web-crawled image-
text pairs annotated with fine-grained details about their concept composition.
Building on DATACONCEPT, we fill another critical gap in the literature: the lack of
a competitive, open-source alternative to highly performant batch sampling methods
for Language-Image Pretraining. Specifically, we introduce Concept-Aware Batch
Sampling (CABS), a simple yet effective batch-sampling algorithm that distills
batches with the broadest set of available concepts. Through rigorous evaluation on
a broad suite of 28 benchmarks, we demonstrate that CABS significantly benefits
Language-Image Pretraining (LIP) and yields highly performant models on long-
tailed evaluations (up to +2.4 p.p. on Let-it-Wag!), while enabling practitioners to
define custom concept distributions that optimize for specific downstream tasks.
Importantly, with only one hyperparameter tuned for a single (backbone, eval)
combination only, CABS shows full compatibility with both CLIP and SigLIP
models. Both DATACONCEPT and the source code for CABS will be released
Primary Area: datasets and benchmarks
Submission Number: 20850
Loading