Less Is More: Distilling Large-Scale Data with LLMs for Chinese-Centric Low-Resource Multilingual Machine Translation

Less Is More: Distilling Large-Scale Data with LLMs for Chinese-Centric Low-Resource Multilingual Machine Translation

ACL ARR 2025 May Submission864 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Neural Machine Translation (NMT) between Chinese and low-resource languages (LRLs) faces significant challenges due to limited, noisy training data. We introduce MERIT, a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian LRLs. Our approach integrates Language-specific Token Prefixing (LTP) for effective language conditioning and supervised fine-tuning (SFT). A key innovation, Group Relative Policy Optimization (GRPO) guided by the Score Accuracy Reward (SAR) function, strategically filters training data and optimizes model performance. Experiments with models up to 3 billion parameters (MERIT-3B) confirm the efficacy of our method. Ablation studies demonstrate substantial improvements from SFT-LTP over zero-shot baselines, while GRPO-SAR achieves further significant gains using only 22.8\% of the original data, increasing BLEU-chrF scores by 17.4\%. MERIT-3B notably surpasses open-source models such as NLLB-200 3.3B by 9.5 BLEU-4 points on Chinese–Indonesian translation and outperforms M2M-100 by 5.1 BLEU-4 points on Chinese–Lao. These findings highlight the pivotal role of targeted data curation and reward-guided training over mere model scaling, advancing multilingual translation in low-resource settings. Code and data are available at https://anonymous.4open.science/r/MERIT-864/.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: few-shot/zero-shot MT, multilingual MT, data-efficient training, LLM, fine-tuning, datasets for low resource languages

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources

Languages Studied: Chinese, English, Vietnamese, Burmese, Thai, Lao, Tagalog

Keywords: few-shot/zero-shot MT, multilingual MT, data-efficient training, LLM, fine-tuning, datasets for low resource languages

Submission Number: 864

Loading