Abstract: We present a new meta-learning method to determine the optimal learning rate schedule
for gradient descent. It leverages training runs from a hyperparameter search to learn a
latent representation of the training process, which is modeled as a dynamical systems.
Given current training metrics, it predicts the future learning rate schedule with the best
long-term validation performance. Our scheduler generalizes beyond previously observed
training dynamics and creates specialized schedules that deviate noticeably from even the
best-performing parametric functions. It outperforms all baselines we compare to on results
for image classification with CNN and ResNet models as well as for next-token prediction
with a transformer model. The trained models are located in flatter regions of the loss
landscape and thus provide better generalization than those trained with other schedules.
Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top
of ML experiment-tracking platforms to streamline training of neural networks from scratch.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Bruno_Loureiro1
Submission Number: 8881
Loading