Parameter-free Statistically Consistent Interpolation: Dimension-independent Convergence Rates for Hilbert kernel regressionDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: Statistically Consistent Interpolation, Hilbert kernel regression, Risk bounds
Abstract: Previously, statistical textbook wisdom has held that interpolation of noisy training data will lead to poor generalization. However, recent work has shown that this is not true and that good generalization can be obtained with function fits that interpolate training data. This could explain why overparameterized deep nets with zero or small training error do not necessarily overfit and could generalize well. Data interpolation schemes have been exhibited that are provably Bayes optimal in the large sample limit and achieve the theoretical lower bounds for excess risk (Statistically Consistent Interpolation) in any dimension. These interpolation schemes are non-parametric Nadaraya-Watson style estimators with singular kernels, which exhibit statistical consistency in any data dimension for large sample sizes. The recently proposed weighted interpolating nearest neighbors scheme (wiNN) is in this class, as is the previously studied Hilbert kernel interpolation scheme. In the Hilbert scheme, the regression function estimator for a set of labelled data pairs, $(x_i,y_i)\in \mathbb{R}^d\times\mathbb{R},~i=0,...,n$, has the form $\hat{f}(x)=\sum_i y_i w_i(x)$, where $w_i(x)= \|x-x_i\|^{-d}/\sum_j \|x-x_j\|^{-d}$. This interpolating function estimator is unique in being entirely free of parameters and does not require bandwidth selection. While statistical consistency was previously proven for this scheme, the precise convergence rates for the finite sample risk were not established. Here, we carry out a comprehensive study of the asymptotic finite sample behavior of the Hilbert kernel regression scheme and prove a number of relevant theorems. We prove under broad conditions that the excess risk of the Hilbert regression estimator is asymptotically equivalent pointwise to $\sigma^2(x)/\ln(n)$ where $\sigma^2(x)$ is the noise variance. We also show that the excess risk of the plugin classifier is upper bounded by $2|f(x)-1/2|^{1-\alpha}\,(1+\varepsilon)^\alpha \sigma^\alpha(x)(\ln(n))^{-\frac{\alpha}{2}}$, for any $0<\alpha<1$, where $f$ is the regression function $x\mapsto\mathbb{E}[y|x]$. Our proofs proceed by deriving asymptotic equivalents of the moments of the weight functions $w_i(x)$ for large $n$, for instance for $\beta>1$, $\mathbb{E}[w_i^{\beta}(x)]\sim_{n\rightarrow \infty}((\beta-1)n\ln(n))^{-1}$. We further derive an asymptotic equivalent for the Lagrange function and explicitly exhibit the nontrivial extrapolation properties of this estimator. Notably, the convergence rates are independent of data dimension and the excess risk is dominated by the noise variance. The bias term, for which we also give precise asymptotic estimates, is always subleading when the density of data at the considered point is strictly positive. If this local density is zero, we show that the bias term does not vanish in the limit of a large data set and we compute its limit explicitly. Finally, we present heuristic arguments and numerical evidence for a universal $w^{-2}$ power-law behavior of the probability density of the weights in the large $n$ limit.
Supplementary Material: pdf
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
13 Replies

Loading