SGD Smooths The Sharpest Directions

Stanisław Jastrzębski; Zac Kenton; Nicolas Ballas; Asja Fischer; Amos Storkey; Yoshua Bengio

SGD Smooths The Sharpest Directions

Stanisław Jastrzębski, Zac Kenton, Nicolas Ballas, Asja Fischer, Amos Storkey, Yoshua Bengio

12 Feb 2018 (modified: 05 May 2023)ICLR 2018 Workshop SubmissionReaders: Everyone

Abstract: Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.

Keywords: SGD, sharpness, regularization

TL;DR: Noise in SGD leads to actually smoothing out the loss surface by controlling spectral norm of the Hessian

4 Replies

Loading