Abstract: Large Language Models (LLMs) have experienced remarkable performance gains through increased parameter counts and training data, but this growth poses significant challenges for on-device deployment. Quantization has emerged as a critical technique to reduce compute and memory overhead in resource-constrained environments. Unfortunately, traditional quantization approaches are hampered by outliers--rare but extreme activation values that stretch quantization ranges and degrade performance. Recent work suggests that the Adam optimizer itself may contribute to outlier formation through its element-wise gradient normalization.
In this paper, we introduce Muon as a practical alternative to Adam for large-scale LLM training. By employing efficient gradient orthogonalization via Newton-Schulz iterations, Muon avoids the heavy overhead common in second-order methods like Shampoo. We further propose an Outlier-Safe Pre-Training (OSP) framework that incorporates learnable embedding rotations and single-scale RMSNorm, suppressing outliers without architectural modifications at inference. Our ablation study on a 100 billion token corpus demonstrates that these components effectively mitigate outliers while maintaining model quality. We validate our approach by training a 1.4B-parameter LLM on 1 trillion tokens--to our knowledge, the first production-scale model trained without Adam. The resulting model exhibits distinct quantization behavior under 4-bit weight and activation (W4A4) quantization compared to existing open-source LLMs, suggesting new possibilities for robust low-bit pre-training in LLM development.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: quantization; NLP in resource-constrained settings;
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7852
Loading