Large-width asymptotics and training dynamics of $\alpha$-Stable ReLU neural networks

Published: 22 Nov 2024, Last Modified: 22 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large-width asymptotic properties of neural networks (NNs) with Gaussian distributed weights have been extensively investigated in the literature, with major results characterizing their large-width asymptotic behavior in terms of Gaussian processes and their large-width training dynamics in terms of the neural tangent kernel (NTK). In this paper, we study large-width asymptotics and training dynamics of $\alpha$-Stable ReLU-NNs, namely NNs with ReLU activation function and $\alpha$-Stable distributed weights, with $\alpha\in(0,2)$. For $\alpha\in(0,2]$, $\alpha$-Stable distributions form a broad class of heavy tails distributions, with the special case $\alpha=2$ corresponding to the Gaussian distribution. Firstly, we show that if the NN's width goes to infinity, then a rescaled $\alpha$-Stable ReLU-NN converges weakly (in distribution) to an $\alpha$-Stable process, which generalizes the Gaussian process. As a difference with respect to the Gaussian setting, our result shows that the activation function affects the scaling of the $\alpha$-Stable NN; more precisely, in order to achieve the infinite-width $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Secondly, we characterize the large-width training dynamics of $\alpha$-Stable ReLU-NNs in terms an infinite-width random kernel, which is referred to as the $\alpha$-Stable NTK, and we show that the gradient descent achieves zero training error at linear rate, for a sufficiently large width, with high probability. Differently from the NTK arising in the Gaussian setting, the $\alpha$-Stable NTK is a random kernel; more precisely, the randomness of the $\alpha$-Stable ReLU-NN at initialization does not vanish in the large-width training dynamics.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The authors wish to thank the Action Editor, Professor Murat A. Erdogdu, and three anonymous Referees for their helpful suggestions. We have revised the paper by addressing the major concerns raised by the Referees. In particular, the major changes can be summarized by the following points: 1) We have rewritten the introduction of the paper following the comments of the Referee KFEa. In particular, the content of the introduction has been reorganized into subsections (Related work, Large-width asymptotics and Large-width training dynamics) in order to emphasize the two main results of our work. In addition, we have included a picture on the sample path of the $\alpha$-Stable neural network with a ReLU activation function, for different values ​​of the stability parameter $\alpha$. 2) We have rewritten Section 2 and Section 3 of the paper following the comments of the Referee yfNq. In particular, each section contains a preliminary summary of the content of the section itself. In both sections, more details have been included, both regarding the proofs of our results and regarding the difference with respect to the Gaussian case. In particular, for each result a sketch of the proof has been added, which summarizes the main arguments. 3) In Section 2 we have included an explanation with respect to the additional term \log m. (see Remark 2.1) 4) In Section 2 we included a figure in which we consider the sample path of the $\alpha$-Stable neural network with a ReLU activation function as the width varies. (see Figure 2 of the revised paper)
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 2651
Loading