Keywords: Efficient pre-training; Model growth; Progressive training
Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.
TL;DR: This paper studies progressive training to speed up pre-training of language models.
Abstract: The remarkable performance of GPT models has attracted widespread attention for large-scale language models. Despite their stunning performance, the huge pre-training cost is prohibitive. Progressive pre-training takes advantage of the faster convergence speed of small models to save computing overhead and shows great potential in accelerating pre-training. This work studies the two key issues in progressive pre-training: growth schedule and growth operation. First, we estimate the optimal growth point in theory. Then, we find in experiments that the growth operation can be performed after the model enters the convergence stage to achieve a high speed-up ratio. On the other hand, we propose progressive dimensionality growth for width expansion and redundant layers for depth expansion. Progressive dimensionality growth is a smoothed operation and improves training stability. Redundant layers implement function-preserving at a small cost and inherit the core parameters of adjacent layers, improving the utilization of knowledge learned by the original model. Our method follows strict function preservation and produces good training dynamics. Experimental results show that our method outperforms the baselines and achieves an acceleration rate of about 1.5 times while achieving the same training effect.
A Signed Permission To Publish Form In Pdf: pdf
Supplementary Material: zip
Primary Area: Deep Learning (architectures, deep reinforcement learning, generative models, deep learning theory, etc.)
Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.
Student Author: Yes
Submission Number: 37
Loading