Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

Teodora Srećković; Jonas Geiping; Antonio Orvieto

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

Teodora Srećković, Jonas Geiping, Antonio Orvieto

Published: 09 Jun 2025, Last Modified: 11 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adam, SGD, Language Models, Attention, Transformers

TL;DR: We investigate the optimizer gap between SGD and Adam in transformer language models, showing that SGD with momentum can match Adam in small-batch settings and identifying batch size as a crucial factor in explaining this gap.

Abstract: Adam is known to perform significantly better than Stochastic Gradient Descent in language models, a phenomenon for which a number of explanations have been proposed. In this work, we revisit this "optimizer gap" through a series of comprehensively tuned baseline training runs for language modeling with transformers. We exhaustively study how momentum, gradient clipping, and batch size affect the gap between SGD and Adam. Our empirical findings show that SGD with momentum can actually perform similarly to Adam in small-batch settings, if tuned correctly. We revisit existing explanations for Adam's advantage, including heavy-tailed class imbalance, directional sharpness, and Hessian heterogeneity, which struggle to explain these findings. Finally, by analyzing our transformer training runs and a simple quadratic setting, we provide new insights into what makes SGD perform poorly - showing that batch size has to be a necessary component of any explanation of the optimizer gap.

Student Paper: Yes

Submission Number: 81

Loading