Scaling Law of Knowledge Exposure for Continual Pre-training of Large Language Models

ACL ARR 2025 May Submission3291 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While general-purpose large language models (LLMs) demonstrate broad capabilities, effective domain knowledge adaptation requires specialized training through continual pre-training (CPT). A key factor in knowledge injection during CPT is $\textit{exposure times}$—how often a model encounters specific knowledge. This paper presents the first systematic study of the scaling relationship between exposure and injection effectiveness. Using synthesized fictitious and real-world datasets, we train models from $0.5$B to $7$B parameters. Results show that injection follows a log-sigmoid trajectory across exposures, with consistent learning phases regardless of model size or knowledge type. We find that required exposure scales with model size following a power law, enabling predictions from small-scale experiments. Notably, relation type—not prior knowledge—primarily determines saturation. We also propose a data synthesis pipeline for more realistic, controllable training setups. These findings reveal predictable scaling behaviors in CPT, offering implications for developing domain-specific language models efficiently.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large Language Model, Scaling Law, Continue Pre-training
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3291
Loading