Abstract: Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.
Keywords: SGD, sharpness, regularization
TL;DR: Noise in SGD leads to actually smoothing out the loss surface by controlling spectral norm of the Hessian
4 Replies
Loading