Adam vs. SGD: Closing the generalization gap on image classification

Aman Gupta

Published: 12 Dec 2021, Last Modified: 12 Feb 2025OPTEveryoneCC BY 4.0

Abstract: Adam is an adaptive deep neural network training optimizer that has been widely used across a variety of applications. However, on image classification problems, its generalization performance is significantly worse than stochastic gradient descent (SGD). By tuning several inner hyperparameters of Adam, it is possible to lift the performance of Adam and close this gap; but this makes the use of Adam computationally expensive. In this paper, we use a new training approach based on layer-wise weight normalization (LAWN) to solidly improve Adam’s performance and close the gap with SGD. LAWN also helps reduce the impact of batch size on Adam’s performance. With speed in tact and performance vastly improved, the Adam-LAWN combination becomes an attractive optimizer for use in image classification.