Keywords: machine translation, low-resource NLP, transliteration, multilingual models, benchmarking, evaluation
Abstract: We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation.
We introduce a new Chakma--Bangla parallel and monolingual dataset, along with a trilingual Chakma--Bangla--English benchmark for evaluation.
To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models.
We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning.
Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: automatic evaluation, few-shot/zero-shot MT, multilingual MT, resources for less-resourced languages, language documentation, datasets and benchmarking, transliteration
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Chakma, Bangla (Bengali), English
Submission Number: 9857
Loading