TL;DR: We show that manifold spectral data can be recovered from noisy samples up to a noise dependent threshold, and investigate the performance of other unsupervised learning algorithms in this setting.
Abstract: In this paper, we clarify the effect of noise on common spectrally
motivated algorithms such as Diffusion Maps (DM) for dimension
reduction. Empirically, these methods are much more robust to noise
than current work suggests. Specifically, existing consistency results
require that either the noise amplitude or dimensionality must vary
with the sample size $n$. We provide new theoretical results
demonstrating that low-frequency eigenpairs reliably capture the
geometry of the underlying manifold under a constant noise level, up to a dimension independent threshold $O(r^{-2})$, where $r$ is the noise amplitude. Our results rely on a decomposition of the manifold Laplacian in the Sasaki
metric, a technique not used before in this area, to our knowledge.
We experimentally validate our theoretical predictions. Additionally, we observe
similar robust behavior for other manifold learning algorithms which
are not based on computing the Laplacian, namely LTSA and VAE.
Lay Summary: Dimension reduction is the task of reducing the complexity of data while maintaining the integrity of essential information. Diffusion Maps is representative of one approach to this problem, where geometric information of a dataset is captured by an embedding related to a fundamental differential operator. The typical setting for these studies is for manifold valued data, data with a low-dimensional latent structure. It has been shown that this structure is essentially preserved by this procedure. In practice, even data with low-dimensional structure likely has some level of noise, and in this work we show what recovery is possible in this setting.
Our key result is that, up to a noise dependent threshold, intrinsic manifold information can be approximated up to a small error using methods like Diffusion Maps. However, beyond this point, any larger embedding will capture mostly extraneous information due to the noise structure. We make this precise by leveraging a particular decomposition of noisy manifold data, the Sasaki metric, and we show that it allows for a neat disentanglement of noise and manifold components. These results can be related to real data by a perturbation argument.
This culminates in a new perspective on why spectral embeddings such as Diffusion Maps are effective. While previous work emphasizes that recovery of manifold geometry is possible for truly low-dimensional data, we show that for very high dimensional data contaminated by noise, low-dimensional structure can still be revealed by these procedures. In this sense, the data is being implicitly denoised, providing valuable insight on its underlying structure. We observe a similar behavior for LTSA and VAEs, indicating a generality to this phenomenon.
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: Dimension Reduction, Denoising, Diffusion Maps, Laplacian, VAE, LTSA
Submission Number: 15271
Loading