AfriVox: Probing Multilingual and Accent robustness of Speech LLMs

AfriVox: Probing Multilingual and Accent robustness of Speech LLMs

ACL ARR 2025 May Submission7103 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in multimodal large language models (LLMs) have enabled impressive speech recognition and translation capabilities, yet these models remain poorly evaluated in low-resource settings, particularly for African languages and non-native English accents. In this work, we systematically compare state-of-the-art speech-based LLMs with traditional Automatic Speech Recognition (ASR) systems across transcription and translation tasks involving dialectally diverse African speech. To support reproducible evaluation, we introduce AfriVox, a novel open-source benchmark comprising medical and non-medical speech samples spanning 20 African languages and 100+ African English accents. Our findings reveal substantial performance disparities, underscoring the limitations of current LLMs in handling underrepresented linguistic varieties. To address this, we fine-tune the newly released Qwen-2.5-Omni for multilingual transcription and translation using NaijaVoices, a 1,800-hour Nigerian speech corpus. Fine-tuning via instruction-tuned, LoRA-based parameter-efficient methods yields a 54% reduction in Word Error Rate (WER) and a 21% average improvement in BLEU scores over baseline models. Our results demonstrate that multimodal LLMs can be effectively adapted for low-resource speech tasks using lightweight techniques. This work provides a foundation for scalable speech technology development in underrepresented languages and informs future research in inclusive multimodal learning.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Multilingualism and Cross-Lingual NLP, Resources and Evaluation, Language Modeling, Speech Recognition, Text-to-Speech and Spoken Language Understanding

Contribution Types: Approaches to low-resource settings, Data resources

Languages Studied: Afrikaans, Akan, Amharic, Arabic, English, French, Ga, Hausa, Igbo, Kinyarwanda, Luganda, Pedi, Sesotho, Shona, Swahili, Tswana, Twi, Xhosa, Yoruba, and Zulu

Keywords: Multilingualism and Cross-Lingual NLP, Resources and Evaluation, Language Modeling, Speech Recognition, Text-to-Speech and Spoken Language Understanding

Submission Number: 7103

Loading