Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

NeurIPS 2023 Workshop ATTRIB Submission22 Authors

Published: 27 Oct 2023, Last Modified: 08 Dec 2023ATTRIB OralEveryoneRevisionsBibTeX
Keywords: neural network optimization, progressive sharpening, edge of stability, adaptive gradient methods, batch normalization
TL;DR: We show the influence of samples w/ large, opposing features which dominate a network's output. This demonstrates the relative importance of small subsets of the training data for the model's predictions at various stages of training.
Abstract: We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics and demonstrates how a small number training points can have an unusually large effect on a network's optimization trajectory and predictions. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong \emph{opposing signals}: consistent, large magnitude features which dominate the network output and occur in both groups with similar frequency. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We complement these experiments with a theoretical analysis of a two-layer linear network on a simple model of opposing signals. Our finding enables new qualitative predictions of behavior during and after training which we confirm experimentally. It also provides a new lens through which to study how specific data influence the learned parameters.
Submission Number: 22