Koumankan: A Scalable And Cost Efficient Way To Extend Common Voice For Dyula And Other African Language

07 Jul 2023 (modified: 07 Dec 2023)DeepLearningIndaba 2023 Conference SubmissionEveryoneRevisionsBibTeX
Keywords: NPL, African Language, Transformer, ASR, AST, MT
TL;DR: This paper presents the state of progress of the koumakan project which aims to build an audio database in dyula and in other african languages.
Abstract: The field of automatic processing of African languages has witnessed significant progress in recent years, thanks to the efforts of researchers and communities such as Masakhane. This progress has resulted in the creation of various machine learning models, including machine translation, automatic speech recognition, and named entity recognition models. These tools have the potential to facilitate technological, social, and financial inclusion by eliminating language barriers. However, it's really important to remember that without rigorously collected, inclusive data, it's impossible to move research forward. We therefore turned our attention to data collection for low-resource languages. That's why, in this paper, we present the Koumankan project, which proposes a scalable and cost-efficient method of extending the CommonVoice dataset for Dyula and other African languages. We discuss our approach and provide an update on the current state of construction of this dataset. The project aims to improve the quality and quantity of speech data available for African languages and promote the development of speech recognition models, machine translation models for these languages. The scalability and cost-effectiveness of our approach make it suitable for gathering large amounts of speech data in a relatively short period. It should be noted that upon completion of this project, all data will be made available to the public.
Submission Category: Machine learning algorithms
Submission Number: 16
Loading