- TL;DR: The NTK linearization is a universal approximator, even when looking arbitrarily close to initialization
- Abstract: This paper establishes rates of universal approximation for the neural tangent kernel (NTK) in the standard setting of microscopic changes to initial weights. Concretely, given a target function f, a target width m, and a target approximation error eps>0, then with high probability, moving the initial weight vectors a distance B/(eps * sqrt{m}) will give a linearized finite-width NTK which is (sqrt(eps) + B/sqrt(eps * m))^2-close to both the target function f, and also the shallow network which this NTK linearized. The constant B can be independent of eps --- particular cases studied here include f having good Fourier transform or RKHS norm --- though in the worse case it scales roughly as 1/eps^d for general continuous functions. The method of proof is to rewrite f with equality as an infinite-width linearized network whose weights are a transport mapping applied to random initialization, and to then sample from this transport mapping. This proof therefore provides another perspective on the scaling behavior of the NTK: redundancy in the weights due to resampling allows weights to be scaled down. Since the approximation rates match those in the literature for shallow networks, this work implies that universal approximation is not reliant upon any behavior outside the NTK regime.
- Keywords: Neural Tangent Kernel, universal approximation, Barron, transport mapping
0 Replies