ChakmaNMT: Low-resource Machine Translation On Endangered Chakma Language

ACL ARR 2025 May Submission7981 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The Chakma Language, spoken by the indigenous Chakma community mostly from Bangladesh and India, lacks digitalization and is at risk of extinction due to its limited linguistic resources. Developing a Machine Translation (MT) model for Chakma to Bangla and vice-versa could play a crucial role in preserving the language and bridging the cultural and linguistic gap between the two. In this paper, we have worked on MT between CCP-BN (Chakma-Bangla) and BN-CCP by introducing a novel dataset of 15,021 parallel samples and 42,783 monolingual Chakma samples. We also present a high-quality benchmark dataset of 600 parallel samples across Chakma, Bangla, and English. Additionally, we developed a transliteration system that converts the extremely low-resource Chakma script into Bangla to leverage existing Bangla pre-trained models. We experimented with both traditional and state-of-the-art approaches, including statistical machine translation, neural machine translation, and in-context learning with LLMs. In our experiments, fine-tuning BanglaT5 with back-translation using transliterated Chakma achieved the highest BLEU scores of 17.8 (CCP-BN) and 4.41 (BN-CCP). We also found that, with our transliteration system, commercial LLMs achieve near state-of-the-art performance using significantly fewer examples.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Low-resource Language
Contribution Types: Approaches to low-resource settings, Data analysis
Languages Studied: Chakma, Bangla
Submission Number: 7981
Loading