Keywords: Adam, Convergence Proof, Sign-Like Descent
TL;DR: A simple yet more promising convergence proof of adam from a sign-like descent perspective
Abstract: Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as ${x} _{t+1} = {x}_t - \frac{\gamma_t}{{\sqrt{{v}_t}+\epsilon}} \circ {m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend.
While many prior works have treated Adam as a sign-like optimizer to interpret its practical advantages, we are the first to formally provide a convergence proof for Adam from the perspective of sign-like descent, expressed as ${x} _{t+1} = {x}_t - \gamma_t \frac{|{m}_t|}{{\sqrt{{v}_t}+\epsilon}} \circ {\rm Sign}({m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{1/4}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{1/4}}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $\epsilon$.
Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.
Primary Area: optimization
Submission Number: 8153
Loading