Large-width asymptotics and training dynamics of $\alpha$-Stable ReLU neural networks

TMLR Paper2651 Authors

08 May 2024 (modified: 18 May 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: There is a recent literature on large-width properties of Gaussian neural networks (NNs), namely NNs with Gaussian distributed weights. Two popular results are: i) the characterization of the large-width asymptotic behavior of NNs in terms of Gaussian processes; ii) the characterization of the large-width training dynamics of NNs in terms of the so-called neural tangent kernel (NTK). In this paper, we investigate large-width asymptotics and training dynamics of $\alpha$-Stable NNs, namely NNs whose weights are distributed according to $\alpha$-Stable distributions, with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, generalizing Gaussian processes. Differently from the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we characterize the large-width training dynamics of $\alpha$-Stable ReLU-NNs in terms of a random kernel, referred to as the $\alpha$-Stable NTK, showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 2651
Loading