TL;DR: We construct novel domains for unstructured web data and demonstrate how it leads to better data curation
Abstract: Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Lay Summary: A central problem when curating training data for language models is how to combine data from different sources. Increasingly, the majority of training data consists of web pages, but most prior work treat the web as a single data source. We study the internal composition of web data by breaking it down into meaningful topic and format categories. Based on these topic and format annotations, we provide insights into existing data curation practices, but also improve them by changing the balance of topics and formats in the training data.
Link To Code: https://github.com/CodeCreator/WebOrganizer
Primary Area: Deep Learning->Large Language Models
Keywords: pre-training, data-centric, data curation, language models
Submission Number: 12137
Loading