SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Chao Ma; Wenbo Gong; Meyer Scetbon; Edward Meeds

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Training LLMs without storing optimizer states

Abstract: Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves ~50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.

Lay Summary: Training giant AI language models is very memory-intensive. One reason is that popular training methods (like one called Adam) store a lot of information about past updates to help the model learn, which uses up huge amounts of memory. Our research introduces a new approach called SWAN that eliminates this issue. SWAN doesn’t keep any past data. Instead, at each learning step it makes two quick adjustments (one to keep the step size stable, and another to steer the step in the right direction) so the model learns smoothly without extra memory. In tests, SWAN trained large language models just as accurately as Adam while needing only about half the training data and far less memory. This is early proof that we can train very powerful AI systems efficiently without heavy memory demands, potentially making advanced AI development faster, cheaper, and more accessible.

Primary Area: Deep Learning->Algorithms

Keywords: Efficient LLM optimization, Stateless Optimizers

Submission Number: 12101

Loading