Efficient Representativeness-Aware Coreset Selection

Zihao Cheng; Binrui Wu; Zhiwei Li; Yuesen Liao; Su Zhao; Shuai Chen; Yuan Gao; WEIZHONG ZHANG

Efficient Representativeness-Aware Coreset Selection

Zihao Cheng, Binrui Wu, Zhiwei Li, Yuesen Liao, Su Zhao, Shuai Chen, Yuan Gao, WEIZHONG ZHANG

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Coreset Selection, Deep Neural Networks, Efficient Representativeness-Aware

Abstract: Dynamic coreset selection is a promising approach for improving the training efficiency of deep neural networks by periodically selecting a small subset of the most representative or informative samples, thereby avoiding the need to train on the entire dataset. However, it remains inherently challenging due not only to the complex interdependencies among samples and the evolving nature of model training, but also to a critical *coreset representativeness degradation issue* identified and explored in-depth in this paper, that is, the representativeness or information content of the coreset degrades over time as training progresses. Therefore, we argue that, in addition to designing accurate selection rules, it is equally important to endow the algorithms with the ability to assess the quality of the current coreset. Such awareness enables timely re-selection, mitigating the risk of overfitting to stale subsets—a limitation often overlooked by existing methods. To this end, this paper proposes an **E**fficient **R**epresentativeness-**A**ware **C**oreset **S**election method for deep neural networks, a lightweight framework that enables dynamic tracking and maintenance of coreset quality during training. While the ideal criterion—gradient discrepancy between the coreset and the full dataset—is computationally prohibitive, we introduce a scalable surrogate based on the signal-to-noise ratio (SNR) of gradients within the coreset, which is the main technical contribution of this paper and is also supported by our theoretical analysis. Intuitively, a decline in SNR indicates overfitting to the subset and declining representativeness. Leveraging this observation, our method triggers coreset updates without requiring costly Hessian or full-batch gradient computations, maintaining minimal computational overhead. Experiments on multiple datasets confirm the effectiveness of our approach. Notably, compared with existing gradient-based dynamic coreset selection baselines, our method achieves up to a 5.4\% improvement in test accuracy across multiple datasets.

Supplementary Material: zip

Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)

Submission Number: 2495

Loading