Keywords: Large language models, memory optimization, optimizer states
Abstract: Large Language Models (LLMs) achieve remarkable performance but at the cost of substantial memory overhead, particularly when trained with memory-intensive adaptive optimizers such as Adam. As model sizes continue to grow, memory efficiency has become a critical bottleneck. Existing approaches often rely on costly techniques such as singular value decomposition or matrix-level operations to reduce memory usage, which can slow training or degrade performance. In this paper, we propose SGD with Row-wise Normalization (SRON), a state-free optimizer motivated by observed row-level gradient disparities in the Attention module. We provide a theoretical analysis establishing SRON’s convergence under non-convex $L$-Lipschitz smoothness conditions, ensuring its soundness for large-scale models. Extensive experiments across architectures (LLaMA, GPT, Gemma) and model sizes (60M–7B parameters) show that SRON reduces optimizer state memory overhead by 90\%–100\% and cuts training time by up to 67\% on billion-parameter models. Moreover, SRON consistently matches or outperforms Adam and other baselines on both pre-training and fine-tuning tasks, demonstrating its effectiveness as a memory-efficient and high-performance optimizer for LLM training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5775
Loading