Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Kaifeng Lyu; Zhiyuan Li; Sanjeev Arora

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Kaifeng Lyu, Zhiyuan Li, Sanjeev Arora

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: normalization, sharpness, gradient descent, scale-invariance, theoretical analysis, edge of stability

TL;DR: We give mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to persistently reduce the sharpness of loss surface.

Abstract: Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here ``sharpness'' is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/understanding-the-generalization-benefit-of/code)

18 Replies

Loading