From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Published: 24 Sept 2025, Last Modified: 09 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: continual pretraining, model growth, scaling law
TL;DR: We show and quantify how an overtrained base model negatively impacts the performance in multi-stage pretraining
Abstract: Bootstrapped pretraining, i.e., reusing a pretrained base model for continual pretraining or model growth, is a promising strategy to reduce the cost of training large language models (LLMs) from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we study the scaling behavior of bootstrapped pretraining and find that scaling efficiency diminishes as the amount of prior training increases: the scaling exponent with respect to second-stage training tokens decays logarithmically with the base model’s pretraining tokens. This saturation in scaling behavior highlights a fundamental trade-off in multi-stage pretraining strategies: the more heavily a model is pretrained, the less benefit additional tokens provide during bootstrapping. Our findings provide practical insights for efficient LLM training and raise important considerations for the reuse of overtrained models.
Submission Number: 60
Loading