Keywords: neural network optimization, progressive sharpening, edge of stability, adaptive gradient methods, batch normalization
TL;DR: We show the influence of samples w/ large, opposing features which dominate a network's output. This offers explanations for several prior observations including the EoS. We analyze a 2-layer linear net, reproducing the observed patterns.
Abstract: We identify a new phenomenon in network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization.
Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong *opposing signals*: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.
Submission Number: 7
Loading