Normalization Gradients are Least-squares Residuals

Yi Liu

Normalization Gradients are Least-squares Residuals

Yi Liu

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Batch Normalization (BN) and its variants have seen widespread adoption in the deep learning community because they improve the training of deep neural networks. Discussions of why this normalization works so well remain unsettled. We make explicit the relationship between ordinary least squares and partial derivatives computed when back-propagating through BN. We recast the back-propagation of BN as a least squares fit, which zero-centers and decorrelates partial derivatives from normalized activations. This view, which we term {\em gradient-least-squares}, is an extensible and arithmetically accurate description of BN. To further explore this perspective, we motivate, interpret, and evaluate two adjustments to BN.

Keywords: Deep Learning, Normalization, Least squares, Gradient regression

TL;DR: Gaussian normalization performs a least-squares fit during back-propagation, which zero-centers and decorrelates partial derivatives from normalized activations.

7 Replies

Loading