Keywords: ReLU networks, Parameter-space symmetries, Geometry of Loss Landscape, Neural Tangent Kernel
TL;DR: The Empirical NTK for deep ReLU networks is rank-deficient at initialization and tracks data set complexity during training; The rank of the empirical NTK can be understood theoretically by understanding data-dependent parameter space symmetries.
Abstract: Mathematical properties of the neural tangent kernel (NTK) have been related--both theoretically and empirically--to convergence of optimization algorithms and the ability of trained models to generalize. However, most existing theoretical results hold only in the infinite width limit and only for standard data distributions. In the present work, we suggest a practical approach to investigating the NTK for finite-width networks, by understanding the parameter-space symmetries of the network in the presence of finite data sets. In particular, the NTK Gram matrix associated to any finite data set can naturally be regarded as an empirical version of the NTK. Moreover, its rank agrees with the functional dimension of the data set, the number of independent parameter perturbations affecting the model’s outputs on the data set. In this work, we explore the evolution of the functional dimension of deep ReLU networks during training, focusing on the relationship to data set complexity, regularization, and training dynamics. Empirically, we find that functional dimension of deep ReLU networks: (1) tracks data set complexity, (2) increases during training until function stabilization, and (3) decreases with stronger weight decay, suggesting that gradient-based optimization algorithms are biased towards simpler functions for ReLU networks. Moreover, our experiments provide strong evidence that--contrary to conventional wisdom--the empirical NTK for deep finite-width ReLU networks is typically rank-deficient at initialization. We offer a potential theoretical explanation for this empirical phenomenon in terms of certain data-dependent hidden equivalences, emphasizing the connection between these equivalences and the geometry of the loss landscape. We also establish a theoretical upper bound on functional dimension in terms of the number of linear regions sampled by the data set.
Primary Area: learning theory
Submission Number: 18498
Loading