TL;DR: A general family of random matrix models are proposed to explore attributes which give rise to heavy-tailed spectral behavior.
Abstract: Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of *heavy-tailed mechanistic universality* (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models---the *high-temperature Marchenko-Pastur (HTMP) ensemble*---to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion'' parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.
Lay Summary: Deep learning models often work best when certain internal patterns follow a “heavy-tailed” shape, meaning most values are small, but a few are extremely large. This strange but consistent pattern, seen in things like weight and gradient matrices, has been linked to better performance, but no one fully understands why it happens. To investigate this, we introduced a new mathematical tool, the high-temperature Marchenko-Pastur (HTMP) model, that helps explain how and why heavy tails emerge during training. We found that heavy-tailed patterns naturally appear when three things are present: complex data, optimal training settings, and hidden structure in the model itself. In fact, our model is able to tune the extent of these heavy tails by changing only one number. This model matters because it connects heavy tails to deeper principles behind learning dynamics, scaling behavior with data, and the stages of training. Our findings suggest that heavy tails aren’t just a curiosity, they may be a core reason why deep learning is so effective.
Primary Area: Theory->Learning Theory
Keywords: heavy tails, random matrix theory, Bayesian learning theory
Submission Number: 10563
Loading