Abstract: Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, we take a step further in understanding LayerNorm. By analyzing the gradients of LayerNorm, we propose that forward normalization is not the only success factor and the derivatives of mean and variance also contribute by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including bias and gain, are not always beneficial in practice because they increase the risk of over-fitting. We speculate that the reason of over-fitting is probably that bias and gain are learned from the training set and cannot adjust itself towards different input distributions in testing. Motivated by this assumption, we propose a novel normalization method called Adaptive Layer Normalization (AdaNorm). Specifically, AdaNorm replaces the bias and gain with a new affine transformation function. The new function can adaptively update scaling weights towards different inputs. Experiments demonstrate that AdaNorm brings better improvements than LayerNorm does on seven datasets. Also, our analysis shows that AdaNorm alleviates the over-fitting problem and demonstrates better convergence in the training process.
Code Link: https://github.com/lancopku/AdaNorm
CMT Num: 2446
1 Reply
Loading