Keywords: Efficient stagewise training, modular training, language model pretraining, implicit bias, simple-to-complex learning
TL;DR: We propose progressive subnetwork training for efficient pre-training of large language models, which improves downstream performance over existing stagewise pre-training methods.
Abstract: Recent developments in large language models have sparked interest in efficient
pretraining methods. Stagewise training approaches to improve efficiency, like
gradual stacking and layer dropping (Reddi et al., 2023; Zhang & He, 2020), have
recently garnered attention. The prevailing view suggests that stagewise dropping
strategies, such as layer dropping, are ineffective, especially when compared to
stacking-based approaches. This paper challenges this notion by demonstrating
that, with proper design, dropping strategies can be competitive, if not better, than
stacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the
model and progressively increases the size of subnetworks during training, until it
trains the full network. We propose an instantiation of this framework — Random
Part Training (RAPTR) — that selects and trains only a random subnetwork (e.g.
depth-wise, width-wise) of the network at each step, progressively increasing the
size in stages. We show that this approach not only generalizes prior works like
layer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer
dropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through
comprehensive experiments, we demonstrate that RAPTR can significantly speed
up training of standard benchmarks like BERT and UL2, up to 33% compared to
standard training and, surprisingly, also shows better downstream performance on
UL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence
of better inductive bias.
Supplementary Material: zip
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8011
Loading