The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon; Lucile Saulnier; Thomas Wang; Christopher Akiki; Albert Villanova del Moral; Teven Le Scao; Leandro Von Werra; Chenghao Mou; Eduardo González Ponferrada; Huu Nguyen; Jörg Frohberg; Mario Šaško; Quentin Lhoest; Angelina McMillan-Major; Gérard Dupont; Stella Biderman; Anna Rogers; Loubna Ben allal; Francesco De Toni; Giada Pistilli; Olivier Nguyen; Somaieh Nikpoor; Maraim Masoud; Pierre Colombo; Javier de la Rosa; Paulo Villegas; Tristan Thrush; Shayne Longpre; Sebastian Nagel; Leon Weber; Manuel Romero Muñoz; Jian Zhu; Daniel Van Strien; Zaid Alyafeai; Khalid Almubarak; Vu Minh Chien; Itziar Gonzalez-Dios; Aitor Soroa; Kyle Lo; Manan Dey; Pedro Ortiz Suarez; Aaron Gokaslan; Shamik Bose; David Ifeoluwa Adelani; Long Phan; Hieu Tran; Ian Yu; Suhas Pai; Jenny Chim; Violette Lepercq; Suzana Ilic; Margaret Mitchell; Sasha Luccioni; Yacine Jernite

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Published: 17 Sept 2022, Last Modified: 20 Apr 2025NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: BigScience, Dataset, Multilingual, Language Modeling

TL;DR: 1.6TB multilingual dataset created collaboratively within BigScience to train language models

Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

Supplementary Material: pdf

URL: https://hf.co/bigscience-data

Dataset Url: Data: https://hf.co/bigscience-data Tooling: https://github.com/bigscience-workshop/data-preparation

License: - Each constituent subset of the dataset will be released under the license that applies to it. (See individual dataset page for specific license information: https://hf.co/bigscience-data) - Tooling code released under Apache 2.0

Author Statement: Yes

Contribution Process Agreement: Yes

In Person Attendance: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/the-bigscience-roots-corpus-a-1-6tb-composite/code)

17 Replies

Loading