Keywords: Coreset selection, Classification
TL;DR: We formulate and develop a novel method of unlabeled coreset selection, which achieves state-of-the-art results across multiple datasets, generalizes to new applications, and is more efficient relative to prior label-based selection methods.
Abstract: Deep learning methods rely on massive data, resulting in substantial costs for storage, annotation, and model training.
Coreset selection aims to select a representative subset of the data to train models with lower cost while ideally performing on par with the full data training.
State-of-the-art coreset selection methods use carefully-designed criteria to quantify the importance of each data example using ground truth labels and dataset-specific training, then select examples whose scores lie in a certain range to construct a coreset.
These methods work well in their respective settings, however, they cannot consider candidate data that are initially unlabeled.
This limits the application of these methods, especially so considering that the majority of real-world data are unlabeled.
To that end, this paper explores the problem of coreset selection for unlabeled data.
We first motivate and formalize the problem of unlabeled coreset selection, which reduces annotation requirements to enable greater scale relative to label-based coreset selection.
We then develop an unlabeled coreset selection method, Blind Coreset Selection (BlindCS), that jointly considers overall data coverage on a distribution as well as the relative importance of each example based on redundancy.
Notably, BlindCS does not use any model- or dataset-specific training, which increases coreset generalization and reduces computation relative to training-based coreset selection.
We evaluate BlindCS on four datasets and confirm the advance over several state-of-the-art methods that use labels and training, leading to a strong baseline for future research in unlabeled coreset selection.
Notably, the BlindCS coreset for ImageNet achieves a higher accuracy than previous label-based coresets at a 90\% prune rate, while removing annotation requirements for 1.15 million images.
We will make our code publicly available with the final paper.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12923
Loading