TL;DR: Adaptive gradient methods, when done right, do not incur a generalization penalty.
Abstract: A commonplace belief in the machine learning community is that using adaptive gradient methods hurts generalization. We re-examine this belief both theoretically and experimentally, in light of insights and trends from recent years.
We revisit some previous oft-cited experiments and theoretical accounts in more depth, and provide a new set of experiments in larger-scale, state-of-the-art settings. We conclude that with proper tuning, the improved training performance of adaptive optimizers does not in general carry an overfitting penalty, especially in contemporary deep learning. Finally, we synthesize a ``user's guide'' to adaptive optimizers, including some proposed modifications to AdaGrad to mitigate some of its empirical shortcomings.
Keywords: Adaptive Methods, AdaGrad, Generalization
Data: [ImageNet](https://paperswithcode.com/dataset/imagenet)
Original Pdf: pdf
7 Replies
Loading