Decoupling Sign and Magnitude in Optimization with Adaptive Momentum Scaling

Decoupling Sign and Magnitude in Optimization with Adaptive Momentum Scaling

ICLR 2026 Conference Submission14682 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimization, Adam, Momentum

Abstract: We introduce Adaptive Momentum Scaling (AMS), a general optimization framework that decouples the direction and magnitude of parameter updates by separately tracking the sign and scale of momentum. AMS unifies and extends existing optimizers; in particular, we show that it recovers Adam and Cautious Adam as special cases through appropriate hyperparameter choices. Building on this framework, we develop Gradient Descent with Adaptive Momentum Scaling (Grams), a novel optimizer that leverages gradient direction for updates while using momentum exclusively for adaptive magnitude scaling. This design enables Grams to achieve more effective loss descent than conventional momentum-based and cautious methods. We provide theoretical guarantees for Grams, including discrete-time descent analysis, and further connect its dynamics to Hamiltonian descent. Empirically, Grams consistently outperforms widely-used optimizers such as Adam, Lion, and their cautious variants across a range of tasks, including pre-training and fine-tuning large language models. Our results demonstrate that AMS and Grams offers a principled and scalable solution for modern deep learning optimization.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 14682

Loading