Keywords: cross-lingual pretraining, data scaling, low-resource languages
TL;DR: Scaling low-resource language performance with high-resource language data
Abstract: Large language models (LLMs) achieve consistent performance gains through data scaling, yet low-resource languages remain limited by small and stagnant dataset sizes.
To address this limitation, we introduce cross-lingual data scaling, where performance in low-resource languages scales with the dataset size of high-resource languages.
We systematically investigate two potential approaches: (i) transforming high-resource language data into synthetic data for low-resource languages via translation or code-switching, and (ii) transferring the learned knowledge from high-resource languages to low-resource languages by adjusting language order and proportion during pretraining.
Experiments on English and Chinese show that data transformation fails to sustain cross-lingual data scaling, whereas knowledge transfer enables low-resource language performance to scale with the growth of high-resource language data.
Building on these findings, we propose ScaleX, a two-stage pretraining framework designed for effective cross-lingual data scaling.
In the first stage, LLMs are pretrained on high-resource language data under a constant learning rate schedule;
in the second stage, training continues on a mixture of high- and low-resource languages under a cosine learning rate schedule.
ScaleX outperforms existing approaches with progressively larger margins as high-resource data scales up, and further generalizes to both multilingual and large-scale bilingual pretraining.
Our analysis also reveals that learning rate scheduling and shared tokens across languages are critical to sustaining performance scaling in low-resource languages.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 24983
Loading