TL;DR: Our results challenge the view that Classifier-based Quality Filtering, a popular data processing method, captures a meaningful notion of data quality.
Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF.
We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality set. Importantly, we find that training on CQF-selected data can outperform training directly on the high-quality set, even when the latter is sufficiently large. This finding alone is particularly striking, given the substantial effort and cost recently devoted to augmenting high-quality data. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well as the low-quality one. Finally, we introduce an optimization-driven notion of data quality and demonstrate that it can be reliably estimated using small-scale proxy experiments. Altogether, our results both elucidate the mechanisms behind CQF and deepen our understanding of data selection methods widely used in practice.
Lay Summary: Modern AI systems are trained on enormous collections of text gathered from the internet, where the quality of documents can vary widely. To improve training data, researchers often use automated filters that try to identify and keep only the “best” documents. One popular method trains a classifier to recognize examples from a small curated dataset and then selects similar documents from the larger web corpus.
In this work, we study whether this filtering method truly identifies higher-quality data. We find that although the method improves performance on some downstream tasks, it does not consistently improve the model’s ability to predict text from the supposedly high-quality source itself. We show that this happens because the filter unintentionally removes parts of the reference dataset it was designed to imitate. We also compare this filtering approach to experiments using synthetic low-quality data and find very different patterns of behavior. Overall, our results suggest that current filtering methods may not measure data quality as directly as commonly assumed, and that better ways of understanding and selecting training data are needed.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Quality Filtering, Data Quality, Pretraining, Data
Originally Submitted PDF: pdf
Submission Number: 28644
Loading