Stable-SPAM: How to Stably Train Large Language Models in 4-Bit

Tianjin Huang; Haotian Hu; Zhenyu Zhang; Gaojie Jin; Xiang Li; Li Shen; Tianlong Chen; Lu Liu; Qingsong Wen; Zhangyang Wang; Shiwei Liu

Stable-SPAM: How to Stably Train Large Language Models in 4-Bit

Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu

19 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 4-bit training, training stability, loss spike, LLMs

TL;DR: We propose Stable-SPAM, a spike-aware optimizer for training stably LLMs in 4-Bit

Abstract: This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose **Stable-SPAM**, which incorporates enhanced gradient normalization and clipping techniques. In particular, **Stable-SPAM** $(1)$ adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; $(2)$ normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that **Stable-SPAM** effectively stabilizes gradient norms in 4-bit LLM training, consistently delivering superior performance compared to Adam and SPAM across model sizes from LLaMA-130M to LLaMA-7B. Notably, our 4-bit LLaMA-1B model trained with **Stable-SPAM** outperforms Adam by up to $3.1$ perplexity. Furthermore, when both models are trained in 4-bit, **Stable-SPAM** achieves the same loss as Adam while requiring only about half the training steps. Code is submitted.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19222

Loading