Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: low resource languages, African Languages, LLM for African languages, Small Language Models
TL;DR: A machine translation system for Ehugbo language showing how the fairness gap in LLMs can be bridged for African Languages
Abstract: Despite advancements in language technologies, large language models (LLMs) continue to exclude low-resource languages, particularly African dialects like Ehugbo, a critically endangered variant of Igbo spoken by fewer than 150,000 people in Afikpo, Nigeria. Ehugbo’s linguistic complexity, featuring two additional alphabets beyond Igbo’s 36, exacerbates its marginalization, as existing models
fail to account for its unique structure. This exclusion perpetuates social and linguistic inequities, leaving speakers of such dialects without access to digital tools that could preserve their language and culture. This paper presents NLP-Ehugbo, a machine translation (MT) system designed to address this fairness gap. Using the only available parallel corpus, 1,021 Ehugbo-English sentences from the New Testament of the Bible, we evaluated and fine-tuned two state-of-the-art models, M2M100 (facebook/m2m100 418M) and NLLB (facebook/nllb-200-distilled-600M). Initial results were stark: M2M100 achieved a BLEU score of 1.2188, while NLLB scored only 0.0262. After fine-tuning, M2M100 improved to 16.1719, and NLLB achieved 20.4016, demonstrating the potential of adapting LLMs for low resource languages. Our findings reveal both promise and challenges. While fine-tuning significantly
improves performance, the lack of diverse datasets limits translation quality and reinforces the need for inclusive data collection practices. This work highlights the importance of community-driven approaches, as linguistic preservation cannot be achieved without the active involvement of native speakers. The significance of NLP-Ehugbo lies in its contribution to the fairness discourse in LLMs. By focusing on Ehugbo, we expose the systemic bias that excludes low-resource dialects and advocate for a more equitable approach to language technologies. This project not only advances the field of low-resource MT but also serves as a call to action for researchers and developers to prioritize linguistic
diversity, ensuring that no language is left behind in the digital age.
Submission Number: 45
Loading