An Archival Perspective on Pretraining Data

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR SpotlightEveryoneRevisionsBibTeX
Keywords: Pretraining data, measurement, archives
TL;DR: Pretraining data design and use is political; an archival perspective reveals how to study not just the data, but the systems that produce it.
Abstract: Research in NLP on pretraining data has largely focused on identifying and mitigating downstream risks in models. We argue that more critical attention is needed to pretraining datasets and the systems that produce them. To highlight the broader range of impacts of pretraining corpora, we consider the analogy between pretraining datasets and archives. Within the broader ecosystem of datasets and models, we focus especially on processes involved in the creation of pretraining data. By adopting an archives perspective, we surface impacts beyond directly shaping model behavior, including the role of pretraining data corpora as independent data artifacts and the ways that their collection shape future practices. In particular, we explore research in NLP that parallels archival practices of appraisal: we consider the practices of filtering of pretraining data and critically examine the problem formulations taken on by this work. In doing so, we underscore how choices about what is included in pretraining data are necessarily subjective decisions about values. We conclude by drawing on archival studies to offer insights on paths forward.
Submission Number: 88