Scaling Language Models on Machine-Translated Data: Effects of Source Text Complexity on Generalization to Native Text
Abstract: Pretraining on Machine-Translated text appears to be a viable alternative for pretraining language models in low-resource languages. Yet, we still lack a clear picture of how well language models scale on such noisy corpora and which properties of the source text matter. We fill this gap with a controlled study in Indonesian and Tamil. Starting from one English corpus, we build two MT datasets—Natural-MT and a Simplified-MT} variant generated with an LLM—and pretrain GPT-2 models of three sizes (124M, 355M, 774M). Our results show: (1) loss on held-out native text continues to fall with model size, indicating that extra capacity learns transferable patterns despite translation noise; (2) models trained on Natural-MT consistently outperform those trained on Simplified-MT, implying that the linguistic richness of the source text survives translation and aids generalization; (3) a brief continual-pretraining phase on a modest native corpus pushes performance beyond a native-only baseline; (4) when downstream task data are also MT, MT-pretrained checkpoints match native-pretrained ones on sentiment analysis, NLI, and causal reasoning, though native exposure remains crucial for toxicity detection. Together, these findings suggest a practical recipe for data-poor languages: translate diverse English text, scale models, and devote any native data to a short adaptation phase.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training, scaling, continual learning, transfer
Contribution Types: Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English, Indonesian, Tamil
Submission Number: 7290
Loading