DataCurBench: Are LLMs Ready to Self‑Curate Pretraining Data?

DataCurBench: Are LLMs Ready to Self‑Curate Pretraining Data?

ACL ARR 2025 May Submission1284 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The quality of pre-training corpora is central to the capabilities of large language models (LLMs), yet current curation pipelines that rely on rule-based filters or small supervised models lack scalability and adaptability. This work introduces DataCurBench, a comprehensive benchmark for evaluating the ability of LLMs to autonomously perform two sequential pre-training data curation tasks: data filtering, which selects high-quality training data, and data cleaning, which improves linguistic form and coherence to enhance training effectiveness. We propose a systematic evaluation framework and present empirical findings that reveal a dual pattern in LLM performance. While LLMs demonstrate near-human proficiency in language-driven data cleaning, they remain limited in data filtering, often failing to consistently apply prompt-based selection criteria and underperforming compared to fine-tuned smaller models. DataCurBench is publicly available\footnote{\url{https://huggingface.co/datasets/anonymousaiauthor/DataCurBench}}, offering a practical benchmark to evaluate data curation, highlight key challenges, and support the development of more efficient and ethical pre-training pipelines.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: resources and evaluation: benchmarking, language modeling: pre-training, interpretability and analysis of models for nlp: robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English, Chinese

Keywords: large language models; pre-training data curation; data filtering; data cleaning; benchmark evaluation

Submission Number: 1284

Loading