Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter; Yair Carmon

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: dataset filtering, dataset curation, vision-language pretraining, CLIP, representation learning

TL;DR: Introducing a learned algorithm for curating datasets for CLIP pretraining that achieves state-of-the-art ImageNet accuracy on the DataComp benchmark.

Abstract: We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that *learns* the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1\% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2\% absolute accuracy increase over all previous results and a 5.5\% increase over results that---like us---use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 27144

Loading