AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

ACL ARR 2025 February Submission7076 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root mean square of a properly weighted momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes. In this setting, momentum offers a naturally smoothed gradient estimate. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance across diverse tasks and architectures, including pretraining runs on GPT-2 and Llama2 (up to 13B parameters). It also excels in reinforcement learning post-training, particularly in the DeepSeek R1-Zero replication task, underscoring its versatility across training paradigms. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Adam, memory-efficient optimizer, LLM
Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7076
Loading