Keywords: deep learning, residual networks, exploding gradient, non-convex optimization
TL;DR: the long-held idea that batch normalization solves the gradient explosion problem while skip connections is not working on this is wrong.
Abstract: The exploding gradient problem was one of the main barriers to training deep neural networks.
It is widely believed that this problem can be solved by techniques like careful weight initialization and normalization layers.
However, we find that exploding gradients still exist in deep plain networks eve with these techniques applied.
Our theory shows that the nonlinearity of activation layers are the
source of such exploding gradients, and the gradient of a plain network increases exponentially with the number of nonlinear layers.
Submission Number: 90
Loading