SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

ACL ARR 2026 January Submission2364 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Optimization, Memory Efficiency, Pretraining, Embedding Layer, Sign-based Optimizer

Abstract: The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Light-state optimizers like SinkGD attempt to solve this, but we identify the embedding layer paradox: these methods fail to handle the sparse, high-variance gradients of the embedding layer, forcing a hybrid design that reverts to AdamW and partially nullifies memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this paradox by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by $1.0$, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Language Modeling, Machine Learning for NLP, Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 2364

Loading