Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks
Keywords: Theory of deep learning, random matrix theory, feature learning, average gradient outer product
TL;DR: We clarify the correlation between the weight covariance and the average gradient outer product, and explain its emergence at early training times.
Abstract: Understanding the mechanisms through which neural networks extract statistics from input-label pairs is one of the most important unsolved problems in supervised learning. Prior works have identified that the gram matrices of the weights in trained neural networks of general architectures are proportional to the average gradient outer product of the model, in a statement known as the Neural Feature Ansatz (NFA). However, the reason these quantities become correlated during training is poorly understood. In this work, we clarify the nature of this correlation and explain its emergence at early training times. We identify that the NFA is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent kernel. We identify a centering of the NFA that isolates this alignment and is robust to initialization scale. We show that, through this centering, the speed of NFA development can be predicted analytically in terms of simple statistics of the inputs and labels.
Student Paper: Yes
Submission Number: 6
Loading