The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Daniel Galvez; Greg Diamos; Juan Manuel Ciro Torres; Juan Felipe Cerón; Keith Achorn; Anjali Gopi; David Kanter; Max Lam; Mark Mazumder; Vijay Janapa Reddi

The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Daniel Galvez, Greg Diamos, Juan Manuel Ciro Torres, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, Vijay Janapa Reddi

Published: 29 Jul 2021, Last Modified: 24 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone

Keywords: speech recognition, dataset, forced alignment, creative commons, supervised learning

TL;DR: We introduce a large, diverse English speech recognition dataset under a CC-BY-SA license.

Abstract: The People’s Speech is a free-to-download 31,400-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA. The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache2.0 license. We show that a model trained on this dataset achieves a 32.17% word error rate on Librispeech’s test-clean test set. Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship.

Supplementary Material: zip

URL: https://mlcommons.org/en/peoples-speech/

Contribution Process Agreement: Yes

Dataset Url: https://mlcommons.org/en/peoples-speech/

License: CC-BY-SA, with a CC-BY subset

Author Statement: Yes

4 Replies

Loading