Keywords: BigScience, Dataset, Multilingual, Language Modeling
TL;DR: 1.6TB multilingual dataset created collaboratively within BigScience to train language models
Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Supplementary Material: pdf
URL: https://hf.co/bigscience-data
Dataset Url: Data: https://hf.co/bigscience-data
Tooling: https://github.com/bigscience-workshop/data-preparation
License: - Each constituent subset of the dataset will be released under the license that applies to it. (See individual dataset page for specific license information: https://hf.co/bigscience-data)
- Tooling code released under Apache 2.0
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: Yes
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/the-bigscience-roots-corpus-a-1-6tb-composite/code)
17 Replies
Loading