Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Yonatan Dukler, Quanquan Gu, Guido Montúfar

31 Aug 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: The success of deep neural networks is in part due to the use of normalization layers. Normaliza- tion layers like Batch Normalization, Layer Nor- malization and Weight Normalization are ubiqui- tous in practice, as they improve generalization performance and speed up training significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization liter- ature focuses on the un-normalized setting, where the functions under consideration do not exhibit the properties of commonly normalized neural net- works. In this paper, we bridge this gap by giving the first global convergence result for two-layer neural networks with ReLU activations trained with a normalization layer, namely Weight Nor- malization. Our analysis shows how the introduc- tion of normalization layers changes the optimiza- tion landscape and can enable faster convergence as compared with un-normalized neural networks.

0 Replies