SVD Provably Denoises Nearest Neighbor Data

ICLR 2026 Conference Submission16082 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: nearest neighbor, planted models
Abstract: We study the Nearest Neighbor Search (NNS) problem in a high-dimensional setting where data originates from a low-dimensional subspace and is corrupted by Gaussian noise. Specifically, we consider a semi-random model where $n$ points from an unknown $k$-dimensional subspace of $\mathbb{R}^d$ ($k \ll d$) are perturbed by zero-mean $d$-dimensional Gaussian noise with variance $\sigma^2$ on each coordinate. Without loss of generality, we may assume the nearest neighbor is at distance $1$ from the query, and that all other points are at distance at least $1+\varepsilon$. We assume we are given only the noisy data and are required to find NN of the uncorrupted data. We prove the following results: 1. For $\sigma \in O(1/k^{1/4})$, we show that simply performing SVD denoises the data; namely, we provably recover accurate NN of uncorrupted data (Theorem 1.1). 2. For $\sigma \gg 1/k^{1/4}$, NN in uncorrupted data is not even {\bf identifiable} from the noisy data in general. This is a matching lower bound on $\sigma$ with the above result, demonstrating the necessity of this threshold for NNS (Lemma 3.1). 3. For $\sigma \gg 1/\sqrt k$, the noise magnitude ($\sigma \sqrt{d}$) is significantly exceeds the inter-point distances in the unperturbed data. Moreover, NN in noisy data is different from NN in the uncorrupted data in general. \end{enumerate} Note that (1) and (3) together imply SVD identifies correct NN in uncorrupted data even in a regime where it is different from NN in noisy data. This was not the case in existing literature (see e.g. (Abdullah et al., 2014)). Another comparison with (Abdullah et al., 2014) is that it requires $\sigma$ to be at least an inverse polynomial in the ambient dimension $d$. The proof of (1) above uses upper bounds on perturbations of singular spaces of matrices as well as concentration and spherical symmetry of Gaussians. We thus give theoretical justification for the performance of spectral methods in practice. We also provide empirical results on real datasets to corroborate our findings.
Primary Area: optimization
Submission Number: 16082
Loading