Keywords: LLM, Multilinguistic
Abstract: Multilingual LLMs are rapidly emerging, accompanied by claims of supporting an ever-increasing number of languages. However, significant gaps remain between their performance in English and in other languages. Due to the limited quantity and quality of low-resource language data, independently improving these languages is a tough route. A natural alternative is to transfer the capabilities learned in English to low-resource languages. Parallel corpora play a key role in such transfer, and some prior works have conducted empirical studies. Yet, which types of parallel corpora contribute most effectively to cross-lingual transfer has not been systematically explored.
To address this, we propose ParaRater, a corpus selection method designed to identify the most valuable English data to be translated into target languages, thereby constructing high-quality parallel corpora that efficiently boost performance in those languages. ParaRater leverages meta-learning to directly align corpus selection with model performance on native target-language data. It further employs a two-stage filtering process to pinpoint data that is only effective when both language versions appear in training—i.e., truly impactful parallel corpora.
We demonstrate the effectiveness of this approach across multiple languages and provide detailed qualitative analyses, offering new insights into cross-lingual transfer in large language models. Our rater, datasets, and code are all released open-source.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8721
Loading