Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

ICLR 2026 Conference Submission16491 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Training Stability, Pre-Training

Abstract: Training stability is a critical challenge in the pre-training of large language models (LLMs), particularly for architectures like Post-Norm Transformers prone to gradient explosion and dissipation. In this paper, we introduce Scale-Distribution Decoupling (SDD), a novel approach designed to enhance training stability by explicitly decoupling the scale and distribution of the weight matrix within fully-connected layers. SDD employs a normalization mechanism to regulate activation magnitudes and a learnable scaling vector to maintain well-conditioned gradients, thereby effectively preventing gradient explosion and dissipation and ensuring stable gradient propagation. This principled separation improves optimization efficiency, especially in deep networks. Extensive experiments across various LLM architectures (dense and MoE) demonstrate that SDD consistently achieves faster convergence and superior performance compared to existing normalization techniques. Furthermore, SDD is lightweight and seamlessly compatible with current frameworks, offering a practical and effective solution for robust LLM training.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16491

Loading