Sensitivity Sampling for Coreset-Based Data Selection

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: clustering, data selection, coreset
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We present a new data selection and active learning strategy, based on sampling points w.r.t their cost in a k-means solution.
Abstract: Given the sustained growth in both training data and model parameters, the problem of finding the most useful training data has become of primary importance for training state-of-the-art and next generation models. We work in the context of active learning and consider the problem of finding the best representative subset of a dataset to train a machine learning model. Assuming embedding representation of the data (coming for example from either a pre-trained model or a generic all-purpose embedding) and that the model loss is Lipshitz with respect to these embedding, we provide a new active learning approach based on k-means clustering and sensitivity sampling. We prove that our new approach allows to select a set of ``typical'' $k$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\epsilon)$ factor and an additive $\epsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input data and $\lambda$ is the Lipshitz constant. Our approach is particularly efficient since it only requires very few inferences from the model ($O(k + 1/\epsilon^2)$). We furthermore demonstrate the performance of our approach on classic datasets and show that it outperforms state-of-the-art methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6024
Loading