TL;DR: We use novel geometric-measure theoretic results to prove a lower bound on the probability that the activation matrix of a finite, shallow network with random weights has full rank, with impacts on separability and memorization at initialization.
Abstract: Deep neural networks exhibit improved training and generalization performance as the number of parameters grows well beyond the size of the training set, contradicting classical intuitions about overfitting. In order to gain a better understanding of this “benign overparameterization”, we analyze the representational capacity of a random one-hidden-layer perceptron with Gaussian weights, no bias and threshold activations.
More precisely, we investigate the following question: when does a hidden layer of dimension $n$ maps $k$ input vectors with pairwise angles at least $\theta$, to a full-rank activation matrix, thus ensuring that a simple linear classifier can perfectly fit those inputs in feature space? This problem has an immediate impact on memorization capacity at initialization and we frame it as a question about hyperplane arrangements on the unit sphere, and we prove new isoperimetric-like inequalities.
This allows us to derive non-trivial lower bounds on the probability that a random embedding avoids the arrangement’s zero-measure regions.
Our results show that once the hidden dimension exceeds a threshold (depending on $\theta$ and the input dimension), hidden representations are linearly independent with high probability. While the case we consider is challenging due to the sparsity of the solution space, this setting highlights crucial, underlying geometric problems and connections to related questions in spherical geometry and linear algebra.
Code Dataset Promise: No
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 1279
Loading