Keywords: deep learning optimization, learning rate tuner, progressive sharpening, edge of stability
TL;DR: We investigate the interplay between curvature dynamics and learning rate tuners through a new learning rate tuning method.
Abstract: Curvature information -- particularly, the largest eigenvalue of the loss
Hessian, known as the sharpness -- often forms the basis for learning rate
tuners. However, recent work has shown that the curvature information undergoes
complex dynamics during training, going from a phase of increasing sharpness to
eventual stabilization. We analyze the closed-loop feedback effect between
learning rate tuning and curvature. We find that classical learning rate tuners
may yield greater one-step loss reduction, yet they ultimately underperform in
the long term when compared to constant learning rates in the full batch regime.
These models break the stabilization of the sharpness, which we explain using a
simplified model of the joint dynamics of the learning rate and the curvature.
To further investigate these effects, we introduce a new learning rate tuning
method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term
curvature stabilization over instantaneous progress on the objective. In the
full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep
learning objectives, outperforming tuned constant learning rates. In the mini
batch regime, we observe that stochasticity introduces confounding effects that
explain the previous success of some learning rate tuners at appropriate batch
sizes. Our findings highlight the critical role of understanding the joint
dynamics of the learning rate and curvature, beyond greedy minimization, to
diagnose failures and design effective adaptive learning rate tuners.
Primary Area: Optimization for deep networks
Submission Number: 12764
Loading