From Rules to Generalization: The efficient pre-traininig of language transformers on texts generated by context-free grammar

MathAI 2025 Conference Submission38 Authors

09 Feb 2025 (modified: 22 Feb 2025)MathAI 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: pre-training; synthetic data; curriculum learning; deep learning; large language models; LLM;
Abstract: This paper investigates the effectiveness of pre-training language transformers on synthetically generated text compared to natural language data, exploring implications for AI safety, model efficiency, and linguistic capability. We present two complementary studies. First, we compare transformer models pre-trained on natural language corpora against those trained on synthetic pseudo-language texts generated via context-free grammar rules. Fine-tuning experiments on the Russian SuperGLUE benchmark reveal statistically equivalent performance, suggesting that controlled synthetic datasets can offer comparable linguistic generalization, particularly in syntax and morphology, while enhancing safety through full data composition control. Second, we propose a curriculum-based training schedule that integrates synthetic data, which accelerates the training speed without sacrificing the accuracy of downstream tasks. Together, these findings highlight the potential of synthetic data as a resource-efficient and safer addition to the pre-training pipeline for LLM.
Submission Number: 38
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview