From Translation to Multilinguality: Revisit the Role of Parallel Data in Multilingual LLM Pretraining
Keywords: multilingual large language models, parallel data, pretraining
Abstract: Multilingual large language models (MLLMs) are commonly trained with parallel data (i.e., concatenated translation pairs) to introduce cross-lingual alignment signal and induce capabilities transfer for non-English languages. However, it remains unclear whether this de facto practice improves general multilingual ability beyond translation. We conduct a controlled, large-scale study comparing two ways of using parallel data in pretraining: (1) standard concatenated translation pairs as a single sample, and (2) treating each side as an independent sample. Across diverse experimental settings, we find consistent results: parallel concatenation yields substantial gains on translation metrics, but offers limited benefits for general monolingual abilities and cross-lingual abilities. This result suggests that while parallel-form alignment signals directly build translation ability, they do not readily transfer into broader multilingual competence through standard learning process. Motivated by this gap, we propose a pragmatic multi-step pipeline to leverage the translation ability induced by parallel data in a data-driven perspective, which consistently improves general monolingual and cross-lingual performance. Our findings clarify the role and limits of parallel data in MLLM pretraining and offer a practical recipe for building more comprehensively capable multilingual models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11046
Loading