Keywords: Large Language Models, Optimization, Memory Efficiency, Pretraining, Embedding Layer, Sign-based Optimizer
Abstract: The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Light-state optimizers like SinkGD attempt to solve this, but we identify the embedding layer paradox: these methods fail to handle the sparse, high-variance gradients of the embedding layer, forcing a hybrid design that reverts to AdamW and partially nullifies memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this paradox by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by $1.0$, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Language Modeling, Machine Learning for NLP, Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2364
Loading