AdaS: Adaptive Scheduling of Stochastic Gradients

Mahdi S. Hosseini; Konstantinos N Plataniotis

AdaS: Adaptive Scheduling of Stochastic Gradients

Mahdi S. Hosseini, Konstantinos N Plataniotis

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Adaptive Stochastic Optimization, Deep Convolution Neural Network, Low-Rank Factorization

Abstract: The choice of learning rate has been explored in many stochastic optimization frameworks to adaptively tune the step-size of gradients in iterative training of deep neural networks. While adaptive optimizers (e.g. AdaM, AdaGrad, RMSProp, AdaBound) offer fast convergence, they exhibit poor generalization characteristics. To achieve better performance, the manual scheduling of learning rates (e.g. step-decaying, cyclical-learning, warmup) is often used but requires expert domain knowledge. It provides limited insight into the nature of the updating rules and recent studies show that different generalization characteristics are observed with different experimental setups. In this paper, rather than raw statistic measurements from gradients (which many adaptive optimizers use), we explore the useful information carried between gradient updates. We measure the energy norm of the low-rank factorization of convolution weights in a convolution neural network to define two probing metrics; knowledge gain and mapping condition. By means of these metrics, we provide empirical insight into the different generalization characteristics of adaptive optimizers. Further, we propose a new optimizer--AdaS--to adaptively regulate the learning rate by tracking the rate of change in knowledge gain. Experimentation in several setups reveals that AdaS exhibits faster convergence and superior generalization over existing adaptive learning methods.

One-sentence Summary: A new adaptive stochastic optimization called AdaS is proposed for training deep convolution neural network which exhibits superior converges compared to the existing adaptive methods and maintains the generalization ability of SGD at the same time.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/adas-adaptive-scheduling-of-stochastic/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=dQzr1dNLkD

14 Replies

Loading