Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

ACL ARR 2026 January Submission4955 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data filtering, efficient verification strategy, high-quality llm data, machine learning

Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. We then build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. By employing a lightweight fastText-based classifier within this pipeline, we successfully process two widely-used pre-training corpora (FineWeb and Chinese FineWeb), resulting in the creation of the higher-quality Ultra-FineWeb dataset with approximately $1.8$ trillion English and $120$ billion Chinese tokens. Empirical evaluations demonstrate that LLMs pre-trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmarks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, automatic creation and evaluation of language resources, NLP datasets

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 4955

Loading