Keywords: data selection, active learning, low-rank approximation
Abstract: In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that given an embedding representation of the data and certain underlying geometric assumptions, $k$-center clustering heuristics can be employed to perform data selection. This notion was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on $k$-means clustering and sensitivity sampling. However, these approaches all assume the datasets intrinsically exhibit certain geometric properties that can be captured by clustering, whereas a large number of datasets actually possess algebraic structure that are better utilized by low-rank approximation, feature selection, or principal component analysis. In this paper, we introduce a new data selection technique based on low-rank approximation and residual sampling. Given an embedding representation of the data with specific assumptions, which intuitively correspond to algebraic or angular notions of Lipschitzness, we give a method that selects roughly $k+\frac{1}{\varepsilon^2}$ items whose average loss approximates the average loss of the entire dataset, up to a relative $(1+\varepsilon)$ error and an additive $\varepsilon\Phi_k$ term, where $\Phi_k$ denotes the optimal rank-$k$ cost for fitting the input embedding. We complement our theoretical guarantees with empirical evaluations, showing that for a number of important real-world datasets, our data selection approach outperforms previous strategies based on uniform sampling or sensitivity sampling.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 22871
Loading