Keywords: LLM, PreNorm, PostNorm, Layer Normalization, Architecture, foundation Models
TL;DR: We propose FuseNorm for Transformers to achieve Pre-layernorm's stability and Post-layernorm's performance without compromises.
Abstract: The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the PreNorm architecture ensures training stability at the cost of potential performance degradation in deep models, while the PostNorm architecture offers strong performance but suffers from severe training instability. In this work, we propose FuseNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. FuseNorm adopts the clean residual path of PreNorm to stabilize signal propagation while employing a PostNorm-style computation that normalizes the output of the residual connection, thereby enhancing model performance. We provide a theoretical analysis demonstrating that FuseNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and alleviating the representation collapse of PreNorm. Empirically, FuseNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12055
Loading