Abstract: The quality of pre-training corpora is central to the capabilities of large language models (LLMs), yet current curation pipelines that rely on rule-based filters or small supervised models lack scalability and adaptability. This work introduces DataCurBench, a comprehensive benchmark for evaluating the ability of LLMs to autonomously perform two sequential pre-training data curation tasks: data filtering, which selects high-quality training data, and data cleaning, which improves linguistic form and coherence to enhance training effectiveness. We propose a systematic evaluation framework and present empirical findings that reveal a dual pattern in LLM performance. While LLMs demonstrate near-human proficiency in language-driven data cleaning, they remain limited in data filtering, often failing to consistently apply prompt-based selection criteria and underperforming compared to fine-tuned smaller models. DataCurBench is publicly available\footnote{\url{https://huggingface.co/datasets/anonymousaiauthor/DataCurBench}}, offering a practical benchmark to evaluate data curation, highlight key challenges, and support the development of more efficient and ethical pre-training pipelines.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: resources and evaluation: benchmarking, language modeling: pre-training, interpretability and analysis of models for nlp: robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English, Chinese
Keywords: large language models; pre-training data curation; data filtering; data cleaning; benchmark evaluation
Submission Number: 1284
Loading