Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks

Published: 01 Jan 2024, Last Modified: 20 May 2025EMNLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique’s most widely spoken language. The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data. We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks. Our evaluation examines the impact of different data types—originally clean text, post-corrected OCR, and back-translated data—and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation. We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets. The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the https://huggingface.co/LIACC repository under the CC BY 4.0 license.
Loading