Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Hanyang Peng; Shuang Qin; Hui Wang; Yue Yu; Zhouchen Lin

Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Hanyang Peng, Shuang Qin, Hui Wang, Yue Yu, Zhouchen Lin

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adam, Convergence Proof, Sign-Like Descent

TL;DR: A simple yet more promising convergence proof of adam from a sign-like descent perspective

Abstract: Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as ${x} _{t+1} = {x}_t - \frac{\gamma_t}{{\sqrt{{v}_t}+\epsilon}} \circ {m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. While many prior works have treated Adam as a sign-like optimizer to interpret its practical advantages, we are the first to formally provide a convergence proof for Adam from the perspective of sign-like descent, expressed as ${x} _{t+1} = {x}_t - \gamma_t \frac{|{m}_t|}{{\sqrt{{v}_t}+\epsilon}} \circ {\rm Sign}({m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{1/4}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{1/4}}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $\epsilon$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.

Primary Area: optimization

Submission Number: 8153

Loading