Survey on Coresets for Deep Learning: Methods and Applications

TMLR Paper6416 Authors

07 Nov 2025 (modified: 20 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This survey presents a comprehensive review of coreset methods in deep learning, an important tool for improving data efficiency in large-scale neural networks. In general, ``coreset'' is an algorithmic technique for selecting a small yet representative subset of data to replace the full dataset, which can yield more efficient training process and meanwhile preserve model performance. In the past 20 years, coreset techniques have been widely applied to many classical machine learning problems, such as clustering, regression and classification. In recent years, the coreset techniques also begin to attract a lot of attention in modern deep learning area. However, designing effective coresets usually is a challenging task since we need to take account of the trade-off among multiple different factors, such as complexity, robustness and accuracy. In this survey, we focus on two common scenarios for using coreset methods in deep learning: (1) reducing the extremely high computational cost for training a deep learning model, and (2) improving the data utilization under resource constraints such as limited label budget or storage capacity. We begin by outlining the fundamental principles, advantages, and design challenges of coresets for these two scenarios. We also discuss the emerging applications of coresets in large language models. Finally, we identify several open problems and promising directions for future research.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 6416
Loading