Eliminating the first moment state in Adam optimizer

ICLR 2026 Conference Submission25381 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Half-memory Adam, efficient Adam, Memory efficient optimizer
TL;DR: We present a novel variant of Adam optimizer that uses one state variable, instead of two
Abstract: The Adam optimizer and its variants are widely used in large-scale machine learning, but their memory footprint is high because they maintain two state variables per parameter. In Adam, the exponential moving average (EMA) of gradients (m) serves as a first-moment estimator, but it also carries variance information that can be exploited to estimate the second moment. Furthermore, the gradient buffer can be repurposed to handle both gradient accumulation and a proxy for the first moment, effectively folding m into the gradient buffer itself. These modifications reduce the number of optimizer state variables from two to one, yielding Half-Memory Adam (HMAdam) and its decoupled-weight-decay variant (HMAdamW). Both variants retain the Adam update rule and hyperparameters. Experiments across discriminative and generative tasks including CNNs, transformers, and diffusion models show that HMAdamW matches the performance of standard AdamW in convergence speed, final accuracy, and runtime, while substantially lowering memory usage. Moreover, this version of Adam retains its convergence properties. This makes it a practical choice for memory-constrained training scenarios such as large-scale language modeling.
Primary Area: optimization
Submission Number: 25381
Loading