Does SGD really happen in tiny subspaces?

Published: 16 Jun 2024, Last Modified: 18 Jul 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: non-convex optimization, deep learning, training dynamics, SGD, Hessian, low-rank subspace
TL;DR: Deep neural networks cannot be trained within the dominant subspace, even though gradients align with this subspace along the training trajectory.
Abstract: Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies show that gradients approximately align with a low-rank eigenspace of the training loss Hessian, referred to as the dominant subspace. This paper investigates whether neural networks can be trained within this subspace. Our primary finding is that projecting the SGD update onto the dominant subspace does not reduce the training loss, suggesting the alignment between the gradient and dominant subspace is spurious. Surprisingly, excluding the dominant subspace component proves as effective as the original update. Similar observations are made for the large learning rate regime (also known as Edge of Stability) and Sharpness-Aware Minimization. We discuss the main causes and implications of this spurious alignment, shedding light on neural network training dynamics.
Student Paper: Yes
Submission Number: 13
Loading