Keywords: Large Language Models, Data Selection, Pre-Training
Abstract: As the performance of large language models (LLMs) emerges via data scaling, the significance of pre-training data becomes increasingly evident. Although methods such as deduplication and high-quality sampling have explored data selection, comprehensive criteria for text quality remain underdeveloped, hindering efficient pre-training data selection and composition.
This paper establishes guidelines for data selection, fosters consensus on data quality, and introduces a management tool to evaluate data quality and domain types.
We believe that robust quality criteria should be applicable across diverse texts, showcasing semantic content understanding, and mutual complement.
Previous work mainly relies on intuition and lacks generalizability. To tackle this, we employ reverse thinking—\emph{prompting LLMs to self-identify the causes of anomalous perplexity (PPL)} in text—and derive 13 quality criteria related to LLM performance, collectively derive a comprehensive
metric as \emph{Overall Score}.
We developed a complete prompt that integrates quality criteria and domain types.
We use LLM's pointwise ratings and compare the computational complexities of pointwise and pairwise ratings (\(O(N)\) v.s. \(O(N^{2})\)), showing that pointwise ratings are more feasible for vast datasets, with over 95\% agreement with human assessments.
By annotating 356K documents using GPT-4-turbo and fine-tuning a Qwen2-1.5B model, we created the \textbf{Data} \textbf{Man}ager (\textbf{DataMan}), with an average fine-tuning accuracy across all criteria approaching 80\% and 81.6\% for \emph{Overall Score}.
We annotated 447B tokens from the slimpajama corpus by DataMan, and selected a 30B token subset to maximize quality representativeness while ensuring domain diversity to train 1.3B-parameter LLM.
Results show that models trained on DataMan-sampled data exceed state-of-the-art benchmarks in in-context learning (ICL) gain by 0.4\% to 4.3\% and in instruct following win rate by 34.2\% to 57\%.
The strongest model \emph{Overall Score l=5}, significantly surpasses models trained on uniform sampling with 50\% more data.
Continued pre-training on high-rated domain-specific data further boosts ICL performance, validating DataMan's effectiveness in domain mixing.
We reveal that PPL and ICL results do not strictly align, underscoring the distinction between understanding and generalization abilities.
Our contributions include: i)-developing a data quality criteria system based on LLM PPL features; ii)-creating DataMan for data quality rating and domain identification; and iii)-releasing our code, models, and annotated datasets to facilitate research on the relationship between data and LLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12977
Loading