Sensitivity Sampling for Coreset-Based Data Selection

Kyriakos Axiotis; Vincent Cohen-Addad; Monika Henzinger; Vahab Mirrokni; David Saulpic; David Woodruff

Sensitivity Sampling for Coreset-Based Data Selection

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Vahab Mirrokni, David Saulpic, David Woodruff

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: clustering, data selection, coreset

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present a new data selection and active learning strategy, based on sampling points w.r.t their cost in a k-means solution.

Abstract: Given the sustained growth in both training data and model parameters, the problem of finding the most useful training data has become of primary importance for training state-of-the-art and next generation models. We work in the context of active learning and consider the problem of finding the best representative subset of a dataset to train a machine learning model. Assuming embedding representation of the data (coming for example from either a pre-trained model or a generic all-purpose embedding) and that the model loss is Lipshitz with respect to these embedding, we provide a new active learning approach based on k-means clustering and sensitivity sampling. We prove that our new approach allows to select a set of ``typical'' $k$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\epsilon)$ factor and an additive $\epsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input data and $\lambda$ is the Lipshitz constant. Our approach is particularly efficient since it only requires very few inferences from the model ($O(k + 1/\epsilon^2)$). We furthermore demonstrate the performance of our approach on classic datasets and show that it outperforms state-of-the-art methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6024

Loading