Keywords: normalization, layer norm, regularization, neural network training dynamics.
Abstract: Post-LayerNorm Transformers have seen limited practical adoption due to enduring difficulties in scaling them to large depths. While prior research has focused on stabilising Post-LayerNorm by improving conditioning at initialisation, stability often deteriorates when models are trained with large learning rates, forcing additional compromises. In the decoder-only settings we study, existing Post-LayerNorm methods fail to outperform strong Pre-LayerNorm baselines. We propose KiteNorm, a novel normalisation method designed to break this trend. KiteNorm achieves stability through residual scaling and a regularisation technique based on hidden-state variance. Beyond stabilisation, KiteNorm learns separate scales for the skip and residual branches in each sublayer, improving performance. Across decoder-only Transformers up to 1B parameters, KiteNorm remains stable throughout training and consistently outperforms leading baselines, with scaling laws favouring KiteNorm across depth, width, batch size, and training budget. Ablations show that each component is necessary for the full stability and performance gains. We offer theoretical insights into KiteNorm through a new perspective on the gradient vanishing problem in Post-LayerNorm, linking the underlying hidden-state expansion to rank collapse.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 125
Loading