Keywords: Amharic ASR, noise robustness, data augmentation, end-to-end models, low-resource languages
Abstract: Automatic Speech Recognition for low-resource languages such as Amharic faces challenges due to limited high-quality data and background noise. This study examines how different types of training data, including clean recordings, noisy recordings, and synthetically augmented data, affect the performance and robustness of an Amharic speech recognition system. The experiments use 155 hours of speech, including 110 hours of clean data from the Andreas Nürnberger Data and Knowledge Engineering Group and 45 hours of real-world noisy recordings. Additional synthetic data were created using noise injection, speed perturbation, and SpecAugment techniques, resulting in a total of 575 hours of speech data. A convolutional neural network with bidirectional gated recurrent units and Connectionist Temporal Classification was trained on four conditions: clean data, noisy data, combined data, and augmented data. The results show that models trained on combined and augmented datasets outperform models trained on a single dataset, achieving a word error rate of 5.49 percent under mixed conditions, a relative improvement of 21.5 percent. These findings highlight the importance of data diversity and augmentation in developing robust speech recognition systems for low-resource languages. Future work will explore the use of visual information, such as lip movements to further improve recognition accuracy in challenging environments.
Submission Number: 35
Loading