Keywords: multilingual machine translation, large language models
Abstract: Large language models have significantly advanced Multilingual Machine Translation (MMT), yet scaling to many languages while keeping quality robust across directions remains challenging.
In this paper, we identify a failure mode of multilingual supervised fine-tuning (SFT) on multi-way parallel data: when such data are reused symmetrically around a pivot language (e.g., English), performance on reverse directions (X $\to$ pivot) can drop substantially.
We term this phenomenon Directional Degeneration and attribute it to excessive many-to-one mappings, which encourage shortcut learning.
We propose \textbf{Strategic Downsampling}, a simple yet effective method to mitigate this degeneration.
In addition, we introduce \textbf{Parallel Multilingual Prompting (PMP)}, which augments translation instructions with an auxiliary parallel sentence to promote cross-lingual transfer during training and enables optional test-time enhancement when auxiliary translations are available.
We further develop \textbf{LMT}, a Chinese–English-centric suite of \textbf{L}arge-scale \textbf{M}ultilingual \textbf{T}ranslation models spanning four sizes (0.6B/1.7B/4B/8B) covering 60 languages and 234 directions.
Comprehensive evaluations show that LMT is competitive among open-source MMT systems, and that our 4B LMT model performs on par with or better than substantially larger baselines.
We release our models to support inclusive and scalable MMT.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: multilingual MT, pre-training for MT, scaling
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Chinese, English, Japanese, Korean, Russian, German, French, Italian, Portuguese, Spanish, Uyghur, Tibetan, Inner Mongolian, Arabic, Bengali, Czech, Persian, Hebrew, Hindi, Indonesian, Khmer, Lao, Malay, Burmese, Dutch, Polish, Thai, Tagalog, Turkish, Urdu, Vietnamese, Cantonese, Tamil, Greek, Ukrainian, Swedish, Norwegian Bokmål, Danish, Kazakh, Croatian, Romanian, Hungarian, Slovak, Javanese, Bulgarian, North Azerbaijani, Nepali, Uzbek, Swahili, Pashto, Amharic, Icelandic, Finnish, Sinhala, Telugu, Marathi, Tajik, Kyrgyz, Georgian, Armenian
Submission Number: 9462
Loading