Boosting Translation Capabilities of Large Language Models with Code-Switching Pretraining

ACL ARR 2024 June Submission4190 Authors

16 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, there has been significant attention on adapting the translation capabilities of Large Language Models. Represented by ALMA, a two-stage training recipe has been developed: first, utilizing a large amount of monolingual data for pretraining to enhance proficiency in non-English languages, followed by fine-tuning with a small amount of high-quality bilingual data. However, in the pretraining process, explicit cross-lingual alignment information is not provided, and excessive use of bilingual data can lead to catastrophic forgetting issues, both of which hinder the further advancement of the model's translation abilities. In this article, we address this issue by introducing a new pretraining process based on Code-Switching pretraining data. In this stage of pretraining, we can provide rich cross-lingual alignment information while ensuring that the training data is semantically coherent documents, which helps alleviate catastrophic forgetting. Moreover, the training process relies solely on monolingual data and a pair of traditional machine translation models, making it highly versatile. Experimental results show that our method has improved the translation quality, achieving state-of-the-art results in similar works.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Machine Translation, Large Language Model, Code-Switching
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Chinese, English, German
Submission Number: 4190
Loading