NormFormer: Improved Transformer Pretraining with Extra Normalization

Sam Shleifer; Myle Ott

NormFormer: Improved Transformer Pretraining with Extra Normalization

Sam Shleifer, Myle Ott

Published: 28 Jan 2022, Last Modified: 15 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: Language Modeling, NLP, Transformer, Zero Shot Learning

Abstract: During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers, while the optimal weighting of residuals is larger at earlier than at later layers. These issues can be alleviated by the addition of two normalization and two new scaling operations inside each layer. The extra operations incur negligible compute cost (+0.5\% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models of multiple sizes. Adding NormFormer on top of the GPT3-Medium architecture can reach the SOTA perplexity 22\% faster, or converge 0.33 perplexity better in the same compute budget. This results in significantly stronger zero shot performance. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9\% on average.

One-sentence Summary: The gradients for transformer language models are too big (small) at early (late) layers and fixing the bug improves substantially over GPT3 and roberta-base.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/normformer-improved-transformer-pretraining/code)

16 Replies

Loading