Rethinking Coreset Selection: The Surprising Effectiveness of Soft Labels

Rethinking Coreset Selection: The Surprising Effectiveness of Soft Labels

TMLR Paper6750 Authors

01 Dec 2025 (modified: 19 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Data-efficient deep learning is an emerging and powerful branch of deep learning that focuses on minimizing the amount of labeled data required for training. Coreset selection is one such method, where the goal is to select a representative subset from the original dataset, which can achieve comparable generalization performance at a much lower computation and disk space overhead. Dataset Distillation (DD), another branch of data-efficient deep learning, achieves this goal through distilling a small synthetic dataset from the original dataset. While DD works exploit soft labels (probabilistic target labels instead of traditional one-hot labels), which have yielded significant improvement over hard labels, to the best of our knowledge, no such study exists for coreset selection. In this work, for the first time, we study the impact of soft labels on generalization accuracy for the image classification task for various coreset selection algorithms. While soft labels improve the performance of all the methods, surprisingly, random selection with soft labels performs on par or better than existing coreset selection approaches. Our findings suggest that future coreset algorithms should benchmark against random selection with soft labels as an important baseline.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Ankit_Singh_Rawat1

Submission Number: 6750

Loading