Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

Published: 03 Feb 2026, Last Modified: 02 May 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Normalization type and position shape prediction behavior at initialization; our theory shows how BatchNorm and LayerNorm differ fundamentally, and highlights the critical role of normalization placement within layers.
Abstract: Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model’s behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our results offer principled guidance for more controlled and interpretable network design, including clarifying how widely used choices such as BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm shape behavior from the outset of training.
Code Dataset Promise: Yes
Code Dataset Url: https://github.com/EmanueleFrancazi/IGB-and-Normalization
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 95
Loading