SGD Smooths The Sharpest Directions

Stanisław Jastrzębski, Zac Kenton, Nicolas Ballas, Asja Fischer, Amos Storkey, Yoshua Bengio

Feb 12, 2018 (modified: Jun 04, 2018) ICLR 2018 Workshop Submission readers: everyone Show Bibtex
  • Abstract: Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.
  • Keywords: SGD, sharpness, regularization
  • TL;DR: Noise in SGD leads to actually smoothing out the loss surface by controlling spectral norm of the Hessian