Abstract: The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. A very common assumption is that the pre-activations are Gaussian. Although this convenient *Gaussian hypothesis* can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental work for finite-width neural networks. Our main contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network depth, even in narrow neural networks, under the assumption that the pre-activations are independent. In the process, we discover a set of constraints that a neural network should satisfy to ensure Gaussian pre-activations. In addition, we provide a critical review of the claims of the Edge of Chaos line of work and construct a non-asymptotic Edge of Chaos analysis. We also propose a unified view on the propagation of pre-activations, encompassing the framework of several well-known initialization procedures. More generally, our work provides a principled framework for addressing the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are guaranteed to be Gaussian?
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=gXxTKxHlPT
Changes Since Last Submission: The main drawback of our initial submission was an error in one of our results, related to the assumption of independent pre-activations. In the current submission, the error has been removed, we have mitigated some of our claims, and we provide an experimental evaluation of the influence of the dependence of the pre-activations (Appendix B).
In addition to that, we have extended our family of initialization distributions/activation functions that we propose to guarantee Gaussian pre-activations. With **the new sub-family, the distribution of the pre-activations is practically Gaussian** (see Figure 7 and Figure 17), which is much better than with the initial one.
Finally, we have added new references and improved the writing.
The substantial changes are highlighted in green.
Assigned Action Editor: ~Russell_Tsuchida1
Submission Number: 3988
Loading