Understanding Sparse Feature Updates in Deep Networks using Iterative Linearisation

TMLR Paper4749 Authors

28 Apr 2025 (modified: 30 Jul 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Larger and deeper neural networks generalise well despite their increased capacity to overfit the data. Understanding why this happens is theoretically and practically important. A recent approach has investigated infinitely wide limits of neural networks through their corresponding Neural Tangent Kernels (NTKs), demonstrating their equivalence to kernel regression with a fixed kernel derived from the network's architecture and initialisation. However, this "lazy training" cannot explain feature learning as such regimes correspond to linearised training in the neural network weight space, which implies a constant NTK kernel throughout training and, as such, does not perform feature learning. In practice, the empirical NTK kernel for finite networks can change substantially, particularly during the initial phase of stochastic gradient descent (SGD), highlighting the importance of feature learning. In this work, we derive iterative linearisation --- an interpolation between SGD and the NTK kernel-based regression. Iterative linearisation enables us to precisely quantify the frequency of feature learning and is shown to be equivalent to NTK kernel-based regression in specific conditions. Empirically, only a surprisingly small amount of feature learning is required to achieve comparable performance to SGD, however, disabling feature learning negatively impacts generalisation. We further justify the validity of iterative linearisation by showing that with large periodicity, it is a special variant of the Gauss-Newton optimisation algorithm. We use this connection to provide novel insights on the role of damping on feature learning and generalisation in Gauss-Newton.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Marco_Mondelli1
Submission Number: 4749
Loading