Keywords: Edge of Stability, Optimization for deep learning, SGD, Instabilities of Training
Abstract: Recent findings by Cohen et al. demonstrate that when training neural networks with full-batch gradient descent with step size $\eta$, the largest eigenvalue~$\lambda_{\max}$ of the full-batch Hessian consistently stabilizes at $\lambda_{\max}=2/\eta$.
These results have significant implications for convergence and generalization.
This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences.
We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS).
In this regime, what stabilizes at $2/\eta$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients.
As a consequence, $\lambda_{\max}$---which is generally smaller than Batch Sharpness---is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima.
We further discuss implications for mathematical modeling of SGD trajectories.
Primary Area: optimization
Submission Number: 15914
Loading