Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Layer Normalization, Large Language Models, Pretraining
TL;DR: Pre-LN destabilizes deep LLM training and limits scaling benefits. We introduce BHyT, which bounds Tanh inputs to ensure stable gradients, enabling faster and more robust pretraining as well as better supervised fine-tuning.
Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, LN and Pre-LN are inefficient due to repeated statistical calculations and suffer from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented, normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT achieves improved stability and efficiency in pretraining, delivering on average 7.7\% faster forward computation and up to 5\% higher token generation throughput than RMSNorm, while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: \url{https://anonymous.4open.science/r/BHyT}
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8591
Loading