On the Dynamics and Convergence of Weight Normalization for Training Neural NetworksDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: Normalization methods, Weight Normalization, Convergence Theory
  • TL;DR: We prove ReLU networks trained with weight normalization converge and analyze distinct behavior of different convergence regimes.
  • Abstract: We present a proof of convergence for ReLU networks trained with weight normalization. In the analysis, we consider over-parameterized 2-layer ReLU networks initialized at random and trained with batch gradient descent and a fixed step size. The proof builds on recent theoretical works that bound the trajectory of parameters from their initialization and monitor the network predictions via the evolution of a ''neural tangent kernel'' (Jacot et al. 2018). We discover that training with weight normalization decomposes such a kernel via the so called ''length-direction decoupling''. This in turn leads to two convergence regimes and can rigorously explain the utility of WeightNorm. From the modified convergence we make a few curious observations including a natural form of ''lazy training'' where the direction of each weight vector remains stationary.
8 Replies