Keywords: data pruning, language models, perplexity, transformers, llms
TL;DR: This paper investigates the quality of large language model training data, comparing simple and complex methods to rank and prune noisy web text datasets.
Abstract: Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. To date, efforts to prune these datasets to higher quality subsets have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data, namely perplexity, the Error L2-Norm, and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. We find that perplexity outperforms other scoring methods and improves over our no-pruning baseline while training on as little as 30\% of the original training dataset. Our work sets a foundation for strategies in automatically curating high quality corpora and suggests that large amounts of pretraining data can be removed while retaining performance.
Submission Number: 49
Loading