Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

Published: 16 Jan 2024, Last Modified: 30 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: neural network optimization, progressive sharpening, edge of stability, adaptive gradient methods, batch normalization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We show the influence of samples w/ large, opposing features which dominate a network's output. This offers explanations for several prior observations including the EoS. We analyze a 2-layer linear net, reproducing the observed patterns.
Abstract: We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics, including a conceptually new cause for progressive sharpening and the edge of stability. We further draw connections to related phenomena including grokking and simplicity bias. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong *Opposing Signals*: consistent, large magnitude features which dominate the network output and provide gradients which point in opposite directions. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We carefully study these groups' effect on the network's optimization and behavior, and we complement this with a theoretical analysis of a two-layer linear network under a simplified model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: optimization
Submission Number: 2132