Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

ACL ARR 2025 February Submission7852 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have experienced remarkable performance gains through increased parameter counts and training data, but this growth poses significant challenges for on-device deployment. Quantization has emerged as a critical technique to reduce compute and memory overhead in resource-constrained environments. Unfortunately, traditional quantization approaches are hampered by outliers--rare but extreme activation values that stretch quantization ranges and degrade performance. Recent work suggests that the Adam optimizer itself may contribute to outlier formation through its element-wise gradient normalization. In this paper, we introduce Muon as a practical alternative to Adam for large-scale LLM training. By employing efficient gradient orthogonalization via Newton-Schulz iterations, Muon avoids the heavy overhead common in second-order methods like Shampoo. We further propose an Outlier-Safe Pre-Training (OSP) framework that incorporates learnable embedding rotations and single-scale RMSNorm, suppressing outliers without architectural modifications at inference. Our ablation study on a 100 billion token corpus demonstrates that these components effectively mitigate outliers while maintaining model quality. We validate our approach by training a 1.4B-parameter LLM on 1 trillion tokens--to our knowledge, the first production-scale model trained without Adam. The resulting model exhibits distinct quantization behavior under 4-bit weight and activation (W4A4) quantization compared to existing open-source LLMs, suggesting new possibilities for robust low-bit pre-training in LLM development.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: quantization; NLP in resource-constrained settings;

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7852

Loading