SWAN: SGD WITH NORMALIZATION AND WHITENING ENABLES STATELESS LLM TRAINING

Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

Published: 17 Dec 2024, Last Modified: 09 May 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves $\approx$ 50% reduction on total end-toend memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2× speedup by reaching the same evaluation perplexity using half as many tokens.