Bayelemabaga: Creating Resources for Bambara NLP

ACL ARR 2024 June Submission5938 Authors

16 Jun 2024 (modified: 23 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In low-resource settings, the problem is often not only the amount of data available, but also the quality, and in ways that are entirely foreign to high-resourced languages. For instance, many extreme low-resource languages have only recently acquired writing systems. This may result in multiple writing systems competing for dominance or, within a single writing system, non-standardized spelling. Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of suitable parallel data. In this case study, we focus on the impact of manual data cleaning on the performance of learning machine translation models. We focus on Bambara, the vehicular language of Mali, and introduce the largest curated dataset for multilingual translation. We finetune six commonly used transformer-based language models, i.e., AfriMBART, AfriMT5, AfriM2M100, Mistral, Open-Llama-7B, and Meta-Llama3-8B on three existing Bambara-French language pair datasets and our curated dataset. We show that our new aligned and curated multilingual dataset enhances the translation quality of all studied models using the BLEU, CHRF++, and AfriCOMET evaluation metrics.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP, Multilinguality and Diversity, Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: Bambara
Submission Number: 5938
Loading