VoxMg: An Automatic Speech Recognition Dataset for MalagasyDownload PDF

Published: 03 Mar 2023, Last Modified: 15 Apr 2023AfricaNLP 2023Readers: Everyone
Keywords: Automatic Speech Recognition, Data Collection, Low-Resource Languages, Malagasy
Abstract: African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Malagasy, cannot benefit from modern NLP methods if no datasets are available. This paper presents the curation and annotation of VoxMg, a speech dataset for Malagasy that consists of 3873 audio files totaling 10.80 hours. We also run a baseline, which is the first Automatic Speech Recognition (ASR) model ever built in this language and obtained a Word Error Rate (WER) of 33%
0 Replies

Loading