Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
SGD Smooths The Sharpest Directions
Stanisław Jastrzębski, Zac Kenton, Nicolas Ballas, Asja Fischer, Amos Storkey, Yoshua Bengio
Feb 12, 2018 (modified: Jun 04, 2018)ICLR 2018 Workshop Submissionreaders: everyoneShow Bibtex
Abstract:Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.
Keywords:SGD, sharpness, regularization
TL;DR:Noise in SGD leads to actually smoothing out the loss surface by controlling spectral norm of the Hessian
Enter your feedback below and we'll get back to you as soon as possible.