Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

ACL ARR 2025 May Submission3361 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advances in Large Language Models (LLMs) and Neural Machine Translation have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates current LLMs in 200 languages using the FLORES-200 benchmark and demonstrates their limitations in LRL translation capability. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained teacher models can significantly improve the performance of small LLMs on LRL translation tasks. For example, this approach increases EN$\Rightarrow$LB with the LLM-as-a-Judge score on the validation set from 0.36 to 0.89 for Llama-3.2-3B. Furthermore, we examine different fine-tuning configurations, providing practical insights on optimal data scale, training efficiency, and the preservation of generalization capabilities of models under study.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: distillation; data augmentation; LLM Efficiency; NLP in resource-constrained settings;

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Luxembourgish

Submission Number: 3361

Loading