Bayelemabaga: Creating Resources for Bambara NLP

Bayelemabaga: Creating Resources for Bambara NLP

ACL ARR 2024 June Submission5938 Authors

16 Jun 2024 (modified: 23 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In low-resource settings, the problem is often not only the amount of data available, but also the quality, and in ways that are entirely foreign to high-resourced languages. For instance, many extreme low-resource languages have only recently acquired writing systems. This may result in multiple writing systems competing for dominance or, within a single writing system, non-standardized spelling. Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of suitable parallel data. In this case study, we focus on the impact of manual data cleaning on the performance of learning machine translation models. We focus on Bambara, the vehicular language of Mali, and introduce the largest curated dataset for multilingual translation. We finetune six commonly used transformer-based language models, i.e., AfriMBART, AfriMT5, AfriM2M100, Mistral, Open-Llama-7B, and Meta-Llama3-8B on three existing Bambara-French language pair datasets and our curated dataset. We show that our new aligned and curated multilingual dataset enhances the translation quality of all studied models using the BLEU, CHRF++, and AfriCOMET evaluation metrics.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Multilingualism and Cross-Lingual NLP, Multilinguality and Diversity, Resources and Evaluation

Contribution Types: Data resources, Data analysis

Languages Studied: Bambara

Submission Number: 5938

Loading