Abstract: The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. A very common assumption is that the pre-activations are Gaussian. Although this convenient *Gaussian hypothesis* can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental work for finite-width neural networks. Our main contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network depth, even in narrow neural networks, under the assumption that the pre-activations are independent. In the process, we discover a set of constraints that a neural network should satisfy to ensure Gaussian pre-activations. In addition, we provide a critical review of the claims of the Edge of Chaos line of work and construct a non-asymptotic Edge of Chaos analysis. We also propose a unified view on the propagation of pre-activations, encompassing the framework of several well-known initialization procedures. More generally, our work provides a principled framework for addressing the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are guaranteed to be Gaussian? Our code is available on GitHub: https://github.com/p-wol/gaussian-preact/ .
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=gXxTKxHlPT
Changes Since Last Submission: The main drawback of our initial submission was an error in one of our results, related to the assumption of independent pre-activations. In the current submission, the error has been removed, we have mitigated some of our claims, and we provide an experimental evaluation of the influence of the dependence of the pre-activations (Appendix B).
In addition to that, we have extended our family of initialization distributions/activation functions that we propose to guarantee Gaussian pre-activations. With **the new sub-family, the distribution of the pre-activations is practically Gaussian** (see Figure 7 and Figure 17), which is much better than with the initial one.
Finally, we have added new references and improved the writing.
Code: https://github.com/p-wol/gaussian-preact
Supplementary Material:  zip
Assigned Action Editor: ~Russell_Tsuchida1
Submission Number: 3988
Loading