SRON: State-free LLM Training via Row-wise Gradient Normalization

Ziqing Wen; Yanqi Shi; Jiahuan Wang; Ping Luo; Linbo Qiao; Dongsheng Li; Tao Sun

SRON: State-free LLM Training via Row-wise Gradient Normalization

Ziqing Wen, Yanqi Shi, Jiahuan Wang, Ping Luo, Linbo Qiao, Dongsheng Li, Tao Sun

15 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, memory optimization, optimizer states

Abstract: Large Language Models (LLMs) achieve remarkable performance but at the cost of substantial memory overhead, particularly when trained with memory-intensive adaptive optimizers such as Adam. As model sizes continue to grow, memory efficiency has become a critical bottleneck. Existing approaches often rely on costly techniques such as singular value decomposition or matrix-level operations to reduce memory usage, which can slow training or degrade performance. In this paper, we propose SGD with Row-wise Normalization (SRON), a state-free optimizer motivated by observed row-level gradient disparities in the Attention module. We provide a theoretical analysis establishing SRON’s convergence under non-convex $L$-Lipschitz smoothness conditions, ensuring its soundness for large-scale models. Extensive experiments across architectures (LLaMA, GPT, Gemma) and model sizes (60M–7B parameters) show that SRON reduces optimizer state memory overhead by 90\%–100\% and cuts training time by up to 67\% on billion-parameter models. Moreover, SRON consistently matches or outperforms Adam and other baselines on both pre-training and fine-tuning tasks, demonstrating its effectiveness as a memory-efficient and high-performance optimizer for LLM training.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5775

Loading