Understanding and Improving Layer Normalization

Jingjing Xu; Xu Sun; Zhiyuan Zhang; Guangxiang Zhao; Junyang Lin

Understanding and Improving Layer Normalization

Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone

Abstract: Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, we take a step further in understanding LayerNorm. By analyzing the gradients of LayerNorm, we propose that forward normalization is not the only success factor and the derivatives of mean and variance also contribute by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including bias and gain, are not always beneficial in practice because they increase the risk of over-fitting. We speculate that the reason of over-fitting is probably that bias and gain are learned from the training set and cannot adjust itself towards different input distributions in testing. Motivated by this assumption, we propose a novel normalization method called Adaptive Layer Normalization (AdaNorm). Specifically, AdaNorm replaces the bias and gain with a new affine transformation function. The new function can adaptively update scaling weights towards different inputs. Experiments demonstrate that AdaNorm brings better improvements than LayerNorm does on seven datasets. Also, our analysis shows that AdaNorm alleviates the over-fitting problem and demonstrates better convergence in the training process.

Code Link: https://github.com/lancopku/AdaNorm

CMT Num: 2446

1 Reply

Loading