Why does SGD prefer flat minima?: Through the lens of dynamical systemsDownload PDF

01 Nov 2022 (modified: 05 May 2023)MLmDS 2023Readers: Everyone
Keywords: Deep learning, Stochastic gradient descent, Flat minima
TL;DR: We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution.
Abstract: We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de facto standard training algorithm for various machine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural networks. An ``escaping'' analysis has been an appealing framework to tackle this question. Escaping analysis measures how quickly SGD escapes from sharp minima, which is likely to have low generalization abilities. Despite its importance, the framework has the limitation that it works only when SGD reaches a stationary distribution after sufficient updates. In this paper, we prove that the SGD escapes from sharp minima exponentially fast even in a non-stationary setting. A key tool for the result is the Large Deviation Theory, a fundamental theory in dynamical systems. In particular, we found that a quantity called ``quasi-potential'' is a suitable tool to describe the SGD's stochastic behavior throughout its training process.
0 Replies

Loading