Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Peter Henderson; Mark Simon Krass; Lucia Zheng; Neel Guha; Christopher D Manning; Dan Jurafsky; Daniel E. Ho

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Peter Henderson, Mark Simon Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, Daniel E. Ho

Published: 17 Sept 2022, Last Modified: 20 Apr 2025NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: data curation, legal data, content filtering, ai and law

TL;DR: In this work we have examine how the law and legal data can inform data filtering practices and provide an extensive 256GB legal dataset (the Pile of Law) that can be used to learn these norms, and for pretraining.

Abstract: One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a ~256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.

Author Statement: Yes

URL: https://huggingface.co/datasets/pile-of-law/pile-of-law

Dataset Url: https://huggingface.co/datasets/pile-of-law/pile-of-law

License: Pile of Law is released under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license, but some data subsets come under different open license variations. See paper appendices for a detailed subset-by-subset breakdown of licenses.

Supplementary Material: pdf

Contribution Process Agreement: Yes

In Person Attendance: No

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/pile-of-law-learning-responsible-data/code)

7 Replies

Loading