Speech Recognition Datasets for Low-resource Congolese LanguagesDownload PDF

Published: 03 Mar 2023, Last Modified: 15 Apr 2023AfricaNLP 2023Readers: Everyone
Keywords: Africa, dataset collection, multilingual speech recognition low-resource languages, self-supervised learning
TL;DR: Dataset collection for low-resource African language
Abstract: Large pre-trained Automatic Speech Recognition (ASR) models have begun to perform better in low-resource languages, as a result of the availability of data and transfer learning. However, a small number of languages have sufficient resources to benefit from transfer learning. This paper contributes to expanding speech recognition resources for under-represented languages. We release two new datasets to the research community namely: Lingala Read Speech Corpus consisting of 4 hours labeled audio clips and Congolese Speech Radio Corpus containing 741 hours of unlabeled audio in 4 major spoken languages in the Democratic Republic of the Congo. Additionally, we obtain state-of-the-art results for Congolese wav2vec2. We observe an average decrease of 2 % in WER when a Congolese multilingual pre-trained model is used for finetuning on Lingala. Importantly, our study is the first attempt towards benchmarking speech recognition systems for Lingala and the first-ever multilingual model for 4 Congolese languages spoken by a combined 65 million people. Our data and models will be publicly available, and we hope they help advance research in ASR for low-resource languages.
0 Replies

Loading