Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: clustering, data selection, coreset
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We present a new data selection and active learning strategy, based on sampling points w.r.t their cost in a k-means solution.
Abstract: Given the sustained growth in both training data and model
parameters, the problem of finding the most useful training data
has become of primary importance for training state-of-the-art and
next generation models.
We work in the context of active learning and consider the problem
of finding the best representative subset of a dataset to
train a machine learning model. Assuming embedding representation of
the data (coming for example from either a pre-trained model or a
generic all-purpose embedding) and that the model loss is Lipshitz
with respect to these embedding, we provide a new active learning
approach based on k-means clustering and sensitivity sampling.
We prove that our new approach allows to select a set of ``typical''
$k$
elements whose average loss corresponds to the average loss of the
whole dataset, up to a multiplicative $(1\pm\epsilon)$ factor and an additive $\epsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input data and $\lambda$ is the Lipshitz constant.
Our approach is particularly efficient since it only
requires very few inferences from the model ($O(k + 1/\epsilon^2)$).
We furthermore demonstrate the performance of our approach on classic
datasets and show that it outperforms state-of-the-art methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6024
Loading