Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data

Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data

ICLR 2026 Conference Submission22324 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Nonparametric function estimation, kernel learning, subdata selection, large-scale data

Abstract: This paper considers estimating nonparametric functions in a reproducing kernel Hilbert space (RKHS) for kernel learning problems with large-scale data. Kernel learning with large-scale data is computationally intensive, particularly due to the high cost and complexity of tuning parameter selection. Existing sampling methods for scalable kernel learning, such as the leverage score-based sampling method and its variants, are designed to select subsamples that minimize the expected global (in-sample or out-of-sample) prediction error. In complement to existing methods, this paper proposes an optimal informative sampling method to estimate nonparametric functions pointwise when the subsample size is potentially small. Our method is tailored for scenarios where computational resources are limited, yet accurate pointwise prediction at each test location is desired. It also serves as a complement to existing fast kernel learning algorithms, such as the Nyström method and FALKON, which rely on randomly selected sub-datasets. Theoretical studies compare the efficiency of the proposed method to that based on the full data with optimally selected tuning parameters. Numerical experiments demonstrate the statistical efficiency of the proposed method over some existing methods based on randomly sampled data.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 22324

Loading