Abstract: With the increasing prevalence of deep neural networks and their growing demand for more powerful hardware, understanding the interplay of model architecture parameters, hardware architecture parameters, model and data parallelism on overall model performance (training time and accuracy) becomes ever more important in order to design next-generation deep learning (DL) hardware. To aid such understanding, this work studies the effect of scaling model size on overall performance, and debunks a long-held belief that larger models must take longer to train.
We first break the total training time into number of steps and time/step. We analytically model the training time per step and empirically study the number of steps to convergence. We observe that larger models take fewer steps to reach to minimum validation loss (halting point). Therefore, the burden is on the hardware community to improve hardware design such that the growth in training time/step would be slower than the decrease in the number of steps as model size scales. If successful, larger models will converge faster, and therefore we can have a larger cake and eat it faster too.
0 Replies
Loading