AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Wenjie Li; Zhaoyang Zhang; Xinjiang Wang; Ping Luo

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Wenjie Li, Zhaoyang Zhang, Xinjiang Wang, Ping Luo

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Optimization Algorithm, Machine Learning, Deep Learning, Adam

TL;DR: A novel adaptive algorithm with extraordinary performance in deep learning tasks.

Abstract: Adaptive optimization algorithms such as RMSProp and Adam have fast convergence and smooth learning process. Despite their successes, they are proven to have non-convergence issue even in convex optimization problems as well as weak performance compared with the first order gradient methods such as stochastic gradient descent (SGD). Several other algorithms, for example AMSGrad and AdaShift, have been proposed to alleviate these issues but only minor effect has been observed. This paper further analyzes the performance of such algorithms in a non-convex setting by extending their non-convergence issue into a simple non-convex case and show that Adam's design of update steps would possibly lead the algorithm to local minimums. To address the above problems, we propose a novel adaptive gradient descent algorithm, named AdaX, which accumulates the long-term past gradient information exponentially. We prove the convergence of AdaX in both convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with SGD.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/adax-adaptive-gradient-descent-with/code)

Original Pdf: pdf

9 Replies

Loading