- Keywords: Optimization, ADAM, Stochastic Gradient Descent, Deep Learning
- TL;DR: We present a new framework for adapting Adam-typed methods, namely AdamT, to include the trend information when updating the parameters with the adaptive step size and gradients.
- Abstract: Adam-typed optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning. Such methods are appealing for capability on large-scale sparse datasets. On top of that, they are computationally efficient and insensitive to the hyper-parameter settings. In this paper, we present a new framework for adapting Adam-typed methods, namely AdamT. Instead of applying a simple exponential weighted average, AdamT also includes the trend information when updating the parameters with the adaptive step size and gradients. The newly added term is expected to efficiently capture the non-horizontal moving patterns on the cost surface, and thus converge more rapidly. We show empirically the importance of the trend component, where AdamT outperforms the conventional Adam method constantly in both convex and non-convex settings.
- Code: https://github.com/xuebin-zh/AdamT
- Original Pdf: pdf