Keywords: Large Language Models, Training Stability, Pre-Training
Abstract: Training stability is a critical challenge in the pre-training of large language models (LLMs), particularly for architectures like Post-Norm Transformers prone to gradient explosion and dissipation. In this paper, we introduce Scale-Distribution Decoupling (SDD), a novel approach designed to enhance training stability by explicitly decoupling the scale and distribution of the weight matrix within fully-connected layers. SDD employs a normalization mechanism to regulate activation magnitudes and a learnable scaling vector to maintain well-conditioned gradients, thereby effectively preventing gradient explosion and dissipation and ensuring stable gradient propagation. This principled separation improves optimization efficiency, especially in deep networks. Extensive experiments across various LLM architectures (dense and MoE) demonstrate that SDD consistently achieves faster convergence and superior performance compared to existing normalization techniques. Furthermore, SDD is lightweight and seamlessly compatible with current frameworks, offering a practical and effective solution for robust LLM training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16491
Loading