Revisiting the Generalization of Adaptive Gradient Methods

Naman Agarwal; Rohan Anil; Elad Hazan; Tomer Koren; Cyril Zhang

Revisiting the Generalization of Adaptive Gradient Methods

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: Adaptive gradient methods, when done right, do not incur a generalization penalty.

Abstract: A commonplace belief in the machine learning community is that using adaptive gradient methods hurts generalization. We re-examine this belief both theoretically and experimentally, in light of insights and trends from recent years. We revisit some previous oft-cited experiments and theoretical accounts in more depth, and provide a new set of experiments in larger-scale, state-of-the-art settings. We conclude that with proper tuning, the improved training performance of adaptive optimizers does not in general carry an overfitting penalty, especially in contemporary deep learning. Finally, we synthesize a ``user's guide'' to adaptive optimizers, including some proposed modifications to AdaGrad to mitigate some of its empirical shortcomings.

Keywords: Adaptive Methods, AdaGrad, Generalization

Data: [ImageNet](https://paperswithcode.com/dataset/imagenet)

Original Pdf: pdf

7 Replies

Loading