Kernel-based Robust Markov Subsampling for Regularized Nonparametric Regression with Contaminated Data
Keywords: Contaminated data; Nonparametric regression; Robust subsampling
TL;DR: Robust subsampling for nonparametric regression with contaminated data
Abstract: Large-scale data with contamination are ubiquitous in biomedicine, economics and social science, but
its statistical learning often suffers from computational bottlenecks and robustness.
Subsampling offers an efficient solution by sampling a representative subset of uncorrupted data from full dataset, thereby reducing computational costs while enhancing robustness. Existing subsampling methods, like leverage- and gradient-based approaches,
focus on parametric models and fail under nonparametric models or severe contamination.
To address these limitations, we propose a kernel-based robust Markov subsampling (KRMS) method for nonparametric regression with
contaminated data in reproducing kernel Hilbert space (RKHS). By dynamically adjusting Markov sampling probabilities based on
the ratio of residuals to kernel norms of predictors, our method simultaneously suppresses contaminated observations
and prioritizes informative observations, enabling robust learning from contaminated datasets. Theoretically, we establish the asymptotic properties of the estimators, including consistency and asymptotic normality, and generalization bounds under RKHS regularization, providing the first unified framework for robust subsampling in nonparametric settings.
Simulations and real-data applications demonstrate KRMS’s superiority over existing methods,
particularly for high contamination levels. Our approach bridges a critical gap in scalable and
robust statistical learning, with broad applicability to large-scale.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 22884
Loading