Revisiting the Stability of Stochastic Gradient Descent: A Tightness Analysis

Yikai Zhang; Samuel Bald; wenjia Zhang; Vamsi Pritham Pingali; Chao Chen; Mayank Goswami

Revisiting the Stability of Stochastic Gradient Descent: A Tightness Analysis

Yikai Zhang, Samuel Bald, wenjia Zhang, Vamsi Pritham Pingali, Chao Chen, Mayank Goswami

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: SGD, Stability, Generalization, Deep Learning

Abstract: The technique of algorithmic stability has been used to capture the generalization power of several learning models, especially those trained with stochastic gradient descent (SGD). This paper investigates the tightness of the algorithmic stability bounds for SGD given by~\cite{hardt2016train}. We show that the analysis of~\cite{hardt2016train} is tight for convex objective functions, but loose for non-convex objective functions. In the non-convex case we provide a tighter upper bound on the stability (and hence generalization error), and provide evidence that it is asymptotically tight up to a constant factor. However, deep neural networks trained with SGD exhibit much better stability and generalization in practice than what is suggested by these (tight) bounds, namely, linear or exponential degradation with time for SGD with constant step size. We aim towards characterizing deep learning loss functions with good generalization guarantees, despite being trained using SGD with constant step size. We propose the Hessian Contractive (HC) condition, which specifies the contractivity of regions containing local minima in the neural network loss landscape. We provide empirical evidence that this condition holds for several loss functions, and provide theoretical evidence that the known tight SGD stability bounds for convex and non-convex loss functions can be circumvented by HC loss functions, thus partially explaining the generalization of deep neural networks.

One-sentence Summary: This paper tightens the algorithmic stability bounds on SGD and, noting the inapplicability of said bounds to the setting of deep learning, provides an empirically-supported hypothesis to explain deep learning generalization.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=JrrHpx0DVK

11 Replies

Loading