Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari; Daniel A. Roberts; Ethan Dyer

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

Keywords: Gradient Descent, Hessian, Deep Learning

TL;DR: For classification problems with k classes, we show that the gradient tends to live in a tiny, slowly-evolving subspace spanned by the eigenvectors corresponding to the k-largest eigenvalues of the Hessian.

9 Replies

Loading