ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

ACL ARR 2026 January Submission9857 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: machine translation, low-resource NLP, transliteration, multilingual models, benchmarking, evaluation

Abstract: We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma--Bangla parallel and monolingual dataset, along with a trilingual Chakma--Bangla--English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: automatic evaluation, few-shot/zero-shot MT, multilingual MT, resources for less-resourced languages, language documentation, datasets and benchmarking, transliteration

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Chakma, Bangla (Bengali), English

Submission Number: 9857

Loading