Abstract: Neural Machine Translation (NMT) normally requires a large amount of parallel corpus to obtain good performance, which is often unavailable for minority languages. Current methods normally pre-train seq2seq models on monolingual data in a denoising manner and then fine-tune the parallel data to improve the performance of low-resource translation. But minority languages used in adjacent areas may co-relate with each other, and jointly modeling them may lead to better performance. In this paper, we propose to improve the performance of Chinese minority language translation with Multilingual NMT (MNMT). As the tokens of the minority languages are not covered by either Chinese BART or mBART and the vocabulary size of the multilingual data exceeds that of the pre-trained model, we map the vocabulary of minority languages to that of the pre-trained BART according to the frequency and enlarge the BART vocabulary by repeating low-frequency tokens respectively to address them. Our experiment results on the CCMT 2023 Chinese minority language translation tasks show that joint modeling can improve the Uyghur-to-Chinese and the Tibetan-to-Chinese tasks by +2.85 and +1.30 BLEU respectively with BART base, and lead to BLEU scores of 55.48, 53.52, and 48.26 on the Mongolian-to-Chinese, Tibetan-to-Chinese and Uyghur-to-Chinese translation tasks respectively with BART large.
0 Replies
Loading