Learning Rate Re-scheduling for AdaGrad in training Deep Neural Networks

16 Sept 2023 (modified: 11 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Supplementary Material: pdf
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Deep neural network, Optimization, AdaGrad, Learning rate schedule
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The adaptive learning rate optimization algorithms have made a great improvement in the training of Deep Neural Networks (DNNs). It has been proved that adaptive learning rate methods can significantly improve training processing and can be adopted into various tasks. AdaGrad, As the first adaptive learning rate optimizer, usually performs worse than the following optimizers, such as Adam, RAdam, Adabelief, etc. There are mainly two reasons: the first is that the stepsize for these optimizers is bounded so that the training is more stable, and the second is that they can use the decoupled weight decay regularization to improve their generalization performance. However, for AdaGrad, the updating delta constantly decreases to zero. Consequently, the weights will change very slowly with the number of training iterations increasing. Meanwhile, it also makes the decoupled weight decay regularization perform unfavorably in AdaGrad. We find that there is a big mistake when using AdaGrad in training DNNs. For other optimizers (e.g., Adam), they prove the regret-bound theorem with learning rate schedule $\frac {1}{\sqrt{T}}$, but in practice, they usually use more advanced learning rate schedule for training DNNs, such as step-wise decay schedule and cosine decay schedule. However, for AdaGrad, the algorithm implicitly contains a learning rate schedule $\frac {1}{\sqrt{T}}$, but in practice, most people directly add another learning rate schedule for AdaGrad. Such two learning rate schedules will largely drop its performance in training DNNs. So in this work, we propose a Learning Rate Re-scheduling (LRR) method for AdaGrad to drop the implicit learning rate $\frac {1}{\sqrt{T}}$, which can largely improve AdaGrad and make decoupled weight decay regularization perform well. The proposed LRR method can also be applied to other AdaGrad-type algorithms (ie, Shampoo). Comprehensive experiments indicate the effectiveness of the proposed LRR method. The source code will be made publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 532
Loading