Abstract: In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change-of-regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width, and compare it with the more commonly studied infinite-width-then-infinite-depth limit.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In light of the reviewers' comments, we have added the following:
* Generalized SDE for multiple inputs (proposition 5 in the appendix) and a discussion on the C-map
* Theoretical results for piece-wise linear activations (Appendix K) and a discussion on the effect of the non-linearity on the distribution of $\|X_t\|$.
* A section on the practical implication of our work (Section 5) where we highlight the main practical takeaways from our results. Notably, we discuss the stability of finite-width large depth networks (with empirical evaluations of the gradient norm) and the phenomenon of network collapse. We also shed light on other practical implications that might be of interest for the ML community.
Assigned Action Editor: ~Balaji_Lakshminarayanan1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 466
Loading