The African Languages Lab: A Global Low-Resource Language Collaborative Approach to Advancing NLP for African Languages
Abstract: The digital revolution has left behind hundreds of millions of speakers of low-resource languages (LRLs), particularly in Africa, creating a critical gap in global information access and technological representation. We introduce the African Languages Lab (All Lab), a systematic approach to advancing NLP capabilities for African LRLs through a coordinated research framework. Our work introduces (1) a quality-controlled data collection pipeline, yielding the largest validated multi-modal speech and text dataset for African LRLs spanning 40 languages, encompassing 500 GB of combined parallel text and 4,000 hours of aligned speech data and (2) experiments demonstrating how our custom dataset, combined with QLoRA, achieves improvements across multiple metrics (up to +49.8 chrF++, +60.2 BLEU, and +0.28 COMET points) compared to the base model. Our work establishes a sustainable framework for expanding NLP capabilities to historically underserved languages while fostering local research capacity through structured mentorship and collaboration. We will release our data for research.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings, multilingualism, multilingual benchmarks, multilingual evaluation, dialects and language varieties, less-resourced languages, endangered languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Afrikaans, Amharic, Arabic, Bambara, Bemba, Berber, Chewa, Ewe, Fang, Fon, Fula, Hausa, Igbo, Kanuri, Kikongo, Kikuyu, Kiluba, Kinyarwanda, Krio, Lingala, Luganda, Malagasy, Mandinka, Mossi, Ngambay, Oromo, Rundi, Sesotho, Shona, Somali, Swahili, Tigrinya, Tshiluba, Tswana, Twi, Umbundu, Wolof, Xhosa, Yoruba, Zulu
Submission Number: 5168
Loading