Cross-Lingual Data Scaling for Large Language Models

Cross-Lingual Data Scaling for Large Language Models

ICLR 2026 Conference Submission24983 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: cross-lingual pretraining, data scaling, low-resource languages

TL;DR: Scaling low-resource language performance with high-resource language data

Abstract: Large language models (LLMs) achieve consistent performance gains through data scaling, yet low-resource languages remain limited by small and stagnant dataset sizes. To address this limitation, we introduce cross-lingual data scaling, where performance in low-resource languages scales with the dataset size of high-resource languages. We systematically investigate two potential approaches: (i) transforming high-resource language data into synthetic data for low-resource languages via translation or code-switching, and (ii) transferring the learned knowledge from high-resource languages to low-resource languages by adjusting language order and proportion during pretraining. Experiments on English and Chinese show that data transformation fails to sustain cross-lingual data scaling, whereas knowledge transfer enables low-resource language performance to scale with the growth of high-resource language data. Building on these findings, we propose ScaleX, a two-stage pretraining framework designed for effective cross-lingual data scaling. In the first stage, LLMs are pretrained on high-resource language data under a constant learning rate schedule; in the second stage, training continues on a mixture of high- and low-resource languages under a cosine learning rate schedule. ScaleX outperforms existing approaches with progressively larger margins as high-resource data scales up, and further generalizes to both multilingual and large-scale bilingual pretraining. Our analysis also reveals that learning rate scheduling and shared tokens across languages are critical to sustaining performance scaling in low-resource languages.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 24983

Loading