Keywords: scaling laws, stochastic gradient descent, shallow neural network, multi-index model
TL;DR: We characterized the scaling laws of neural networks on anisotropic data and prove that vanilla SGD is fundamentally suboptimal. Normalized algorithms overcome this bottleneck. Experiments confirm these insights extend to general activations.
Abstract: Recent theoretical work has shown that nonlinear solvable models exhibit scaling laws in the feature-learning regime. However, these results largely rely on the assumption of isotropic inputs, and understanding how these laws extend to anisotropic data remains a central open problem. In this work, we address this gap by analyzing the learning dynamics of two-layer neural networks with quadratic activations under anisotropic Gaussian inputs. We provide a sharp characterization of online stochastic gradient descent (SGD), explicitly quantifying how the covariance spectrum influences both the scaling exponent and sample complexity. Furthermore, we establish that normalization techniques overcome the intrinsic limitations of vanilla SGD, strictly improving sample efficiency in anisotropic settings. Experiments on two-layer networks with general activation functions support our theoretical predictions, suggesting that these insights extend well beyond the quadratic model.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 155
Loading