What is an Optimal Growth Schedule for Large Language Models? A Theoretical Study

Xue Han; Qian Hu; Yitong Wang; wenchun.gao; Qing Wang; Qicheng Li; Chao Deng; Junlan Feng

What is an Optimal Growth Schedule for Large Language Models? A Theoretical Study

Xue Han, Qian Hu, Yitong Wang, wenchun.gao, Qing Wang, Qicheng Li, Chao Deng, Junlan Feng

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model growth, Optimal growth schedule, Efficient LLM Pre-Training

TL;DR: A growth Schedule Learning methodology via Optimal Path, for multi-stage growth of models with minimal experimental training

Abstract: Existing training methods for Transformer-based large language models (LLMs) rely on massive amounts of data training from scratch, which requires a high cost in terms of computation and time. One promising research direction has developed effective lifelong learning pipelines for efficient LLM pre-training by growing from small pre-trained models to large ones—a technique known as model growth. There are two main research problems associated with model growth: growth schedule and growth operators. Existing research focuses on growth operators, detailing specific manipulations of potential dimensions to expand Transformer parameters. Few studies have investigated the optimal growth schedule, which involves integrating all possible growth operators to create an optimal multi-staged growth path. This work gives a theoretical study regarding what is an optimal growth schedule for multi-stage growth of LLMs by introducing a Schedule Learning methodology that uses an Optimal Path requiring minimal experimental training, referred to as SLOP. SLOP utilizes marginal utility as an appropriate measure for an optimal schedule that balances training costs and model performance after multi-stage growth. With this measurement, the objective of determining the optimal growth schedule is converted into a dynamic programming problem, which is then solved mathematically in polynomial time. Experiments with up to 7B target LLM show SLOP's theoretical validity as well as its efficiency, outperforming alternative schedules in a range of settings.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 8279

Loading