Abstract: Despite its empirical success, the theoretical underpinnings of the stability, convergence and acceleration properties of batch normalization (BN) remain elusive. In this paper, we attack this problem from a modelling approach, where we perform thorough theoretical analysis on BN applied to simplified model: ordinary least squares (OLS). We discover that gradient descent on OLS with BN has interesting properties, including a scaling law, convergence for arbitrary learning rates for the weights, asymptotic acceleration effects, as well as insensitivity to choice of learning rates. We then demonstrate numerically that these findings are not specific to the OLS problem and hold qualitatively for more complex supervised learning problems. This points to a new direction towards uncovering the mathematical principles that underlies batch normalization.
Keywords: Batch normalization, Convergence analysis, Gradient descent, Ordinary least squares, Deep neural network
TL;DR: We mathematically analyze the effect of batch normalization on a simple model and obtain key new insights that applies to general supervised learning.
Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [Fashion-MNIST](https://paperswithcode.com/dataset/fashion-mnist), [MNIST](https://paperswithcode.com/dataset/mnist)
13 Replies
Loading