AfriVox: Probing Multilingual and Accent robustness of Speech LLMs

ACL ARR 2025 May Submission7103 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in multimodal large language models (LLMs) have enabled impressive speech recognition and translation capabilities, yet these models remain poorly evaluated in low-resource settings, particularly for African languages and non-native English accents. In this work, we systematically compare state-of-the-art speech-based LLMs with traditional Automatic Speech Recognition (ASR) systems across transcription and translation tasks involving dialectally diverse African speech. To support reproducible evaluation, we introduce AfriVox, a novel open-source benchmark comprising medical and non-medical speech samples spanning 20 African languages and 100+ African English accents. Our findings reveal substantial performance disparities, underscoring the limitations of current LLMs in handling underrepresented linguistic varieties. To address this, we fine-tune the newly released Qwen-2.5-Omni for multilingual transcription and translation using NaijaVoices, a 1,800-hour Nigerian speech corpus. Fine-tuning via instruction-tuned, LoRA-based parameter-efficient methods yields a 54% reduction in Word Error Rate (WER) and a 21% average improvement in BLEU scores over baseline models. Our results demonstrate that multimodal LLMs can be effectively adapted for low-resource speech tasks using lightweight techniques. This work provides a foundation for scalable speech technology development in underrepresented languages and informs future research in inclusive multimodal learning.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP, Resources and Evaluation, Language Modeling, Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Afrikaans, Akan, Amharic, Arabic, English, French, Ga, Hausa, Igbo, Kinyarwanda, Luganda, Pedi, Sesotho, Shona, Swahili, Tswana, Twi, Xhosa, Yoruba, and Zulu
Keywords: Multilingualism and Cross-Lingual NLP, Resources and Evaluation, Language Modeling, Speech Recognition, Text-to-Speech and Spoken Language Understanding
Submission Number: 7103
Loading