Extending MTOB (Machine Translation from One Book) for Extremely Low-Resource Indonesian Languages

ACL ARR 2026 May Submission16305 Authors

26 May 2026 (modified: 11 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: machine translation, low-resource languages, Tolaki, Talaud, Indonesian languages, retrieval-augmented prompting, morphological analyzer, Chain-of-Thought, in-context learning
Abstract: Machine translation for low-resource languages remains challenging due to the lack of parallel corpora. This study extends the Machine Translation from One Book (MTOB) work (Tanzer et al., 2023) to two extremely low-resource Indonesian languages, Tolaki and Talaud, into Indonesian. To accommodate their complex agglutinative morphology, we combine retrieval-augmented prompting, a rule-based morphological analyzer, and Chain-of-Thought translation prompting. Experiments with Qwen3 show substantial gains over zero-shot baselines. For Tolaki, BLEU improves from 0.9 to 22.8 and chrF from 20.9 to 53.8; for Talaud, BLEU rises from 0.7 to 5.1 and chrF from 19.5 to 37.0. The results indicate that structured linguistic documentation can meaningfully improve translation in unseen languages. However, error analysis reveals that translation quality remains heavily constrained by out-of-vocabulary (OOV) rates, which trigger cascading hallucinations and forced fluency, underscoring the critical necessity of expansive lexical coverage.
Paper Type: Short
Research Area: Machine Translation
Research Area Keywords: few-shot/zero-shot MT, multilingual MT
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Tolaki, Talaud, Indonesian
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 16305
Loading