Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

Published: 16 Jun 2024, Last Modified: 19 Jul 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning rate, warmup, gpt, transformer, critical batch size, rotational, angular
TL;DR: We show that small GPT models can be trained without learning rate warmup by better controlling the update size
Abstract: Learning Rate Warmup is a popular heuristic for training neural networks, which downscales early updates relative to later ones. This aids training, suggesting that the initial updates are too large in some sense, but *why and by which criteria* remains unclear. In this work we explore this for small GPT training by assessing and controlling the update size via various metrics. We find the standard L2-norm of the updates to be insufficient, but using relative changes of either the matrix weights or neural representations is promising for reducing or eliminating the need for explicit warmup. Quantifying the updates in representation space in particular can help withstand changes in the gradient signal-to-noise ratio or "critical batch size" throughout training, which warmup can help counteract but simpler weight based methods fail to account for.
Student Paper: Yes
Submission Number: 63
Loading