Deep Progressive Training: scaling up depth capacity of zero-layer model

Deep Progressive Training: scaling up depth capacity of zero-layer model

ICLR 2026 Conference Submission20052 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: deep leaning, progressive training, model growth, optimization, feature learning

TL;DR: Progressive training by expanding a zero-layer model to multi-layer model is the most efficient without loss degradation.

Abstract: Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training -- an effective strategy where model capacity scales up during training, has emerged to significantly reduce computation with little to none performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero-layer single-stage progressive training for the optimal tradeoff between computation and loss (and accuracy). For example, zero-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently accelerate by $5\times$, and achieve a loss comparable to a fully trained 60-layer model with 7B parameters.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20052

Loading