Keywords: data filtering, efficient verification strategy, high-quality llm data, machine learning
Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. We then build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. By employing a lightweight fastText-based classifier within this pipeline, we successfully process two widely-used pre-training corpora (FineWeb and Chinese FineWeb), resulting in the creation of the higher-quality Ultra-FineWeb dataset with approximately $1.8$ trillion English and $120$ billion Chinese tokens. Empirical evaluations demonstrate that LLMs pre-trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmarks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, automatic creation and evaluation of language resources, NLP datasets
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: English, Chinese
Submission Number: 4955
Loading