XLLaMA2: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

ACL ARR 2024 June Submission4457 Authors

16 Jun 2024 (modified: 09 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To mitigate this, we continued pre-train LLaMA2-7B to support translation across more than 100 languages. Following a thorough analysis of training strategies, including vocabulary expansion and data augmentation, we apply extensive multilingual continued pre-training to the LLaMA series model, resulting in XLLaMA2. Without loss of the generality ability, the translation performance of XLLaMA2 significantly surpassed existing LLMs and is on par with that of a specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Specifically, XLLaMA2 achieves an average spBLEU score improvement of over 10 points compared to the original LLaMA2 model. Further testing XLLaMA2 on Flores-200, XLLaMA2 exhibited notable performance gains even for languages not included in the training set. We will make the code and model publicly available.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Translation, Large Language Model, Continued Pre-training
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: 102 languages coverd by Flores-101
Submission Number: 4457
Loading