TL;DR: This work presents a multi-scale adaptive theory to understand feature learning in neural networks, bridging kernel rescaling and kernel adaptation across different scaling regimes and providing insights into directional feature learning effects.
Abstract: Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.
Lay Summary: Understanding how neural networks learn structure from data is important for improving their performance, making them more interpretable, and allowing practitioners to catch errors in the model training. Researchers have developed different theories to explain this learning process. Some suggest that training mainly changes the amplitude of the network output by a scaling factor. Others believe that training causes the network to reshape its understanding of data in more complex ways, adapting its behavior depending on the features of the data it sees.
These two perspectives have mostly been seen as separate, each with its own strengths. Our paper brings these views together under a new theoretical framework that shows how networks adapt to the data through learning, depending on the scale of the network output. Using tools from physics, we derive mathematical formulas that describe how a network behaves depending on the output scale. Surprisingly, we find that even when a network seems to behave simply, it's still quietly learning more complex patterns in the data. Our insights help understand how and when neural networks adapt to data.
Link To Code: https://zenodo.org/records/15480898
Primary Area: Theory->Deep Learning
Keywords: feature learning, deep learning theory, statistical field theory, lazy & rich learning
Submission Number: 12533
Loading