SGD Smooths The Sharpest Directions

Stanisław Jastrzębski, Zac Kenton, Nicolas Ballas, Asja Fischer, Amos Storkey, Yoshua Bengio

Feb 12, 2018 (modified: Feb 12, 2018) ICLR 2018 Workshop Submission readers: everyone
  • Abstract: Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.
  • TL;DR: Noise in SGD leads to actually smoothing out the loss surface by controlling spectral norm of the Hessian
  • Keywords: SGD, sharpness, regularization