Concept-Aware Batch Sampling Improves Language-Image Pretraining

ICLR 2026 Conference Submission20850 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language, clip, data curation, batch sampling, pretraining
TL;DR: Concept-aware data curation and batch sampling improves the downstream performance of contrastive vision-language models.
Abstract: What data should a CLIP model see? Many data curation efforts aiming to answer this question center on the quality of a dataset. However, recent work has shown that while admitting impressive performance benefits, none of these curation methods are concept-centric, leading to them inheriting the biased properties of web-scale data distributions. In this work, we go beyond such concept-agnostic methods and advocate a more flexible online concept-based curation approach. To enable this, our first contribution is DATACONCEPT, a collection of 128M web-crawled image- text pairs annotated with fine-grained details about their concept composition. Building on DATACONCEPT, we fill another critical gap in the literature: the lack of a competitive, open-source alternative to highly performant batch sampling methods for Language-Image Pretraining. Specifically, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch-sampling algorithm that distills batches with the broadest set of available concepts. Through rigorous evaluation on a broad suite of 28 benchmarks, we demonstrate that CABS significantly benefits Language-Image Pretraining (LIP) and yields highly performant models on long- tailed evaluations (up to +2.4 p.p. on Let-it-Wag!), while enabling practitioners to define custom concept distributions that optimize for specific downstream tasks. Importantly, with only one hyperparameter tuned for a single (backbone, eval) combination only, CABS shows full compatibility with both CLIP and SigLIP models. Both DATACONCEPT and the source code for CABS will be released
Primary Area: datasets and benchmarks
Submission Number: 20850
Loading