TL;DR: A straightforward and effective core-set selection method identifying valuable samples early in the training process.
Abstract: Core-set selection (CS) for deep learning has become crucial for enhancing training efficiency and understanding datasets by identifying the most informative subsets. However, most existing methods rely on heuristics or complex optimization, struggling to balance efficiency and effectiveness. To address this, we propose a novel CS objective that adaptively balances losses between core-set and non-core-set samples by minimizing the sum of squared losses across all samples. Building on this objective, we introduce the
Maximum Reduction as Maximum Contribution criterion (MRMC), which identifies samples with the maximal reduction in loss as those making the maximal contribution to overall convergence. Additionally, a balance constraint is incorporated to ensure an even distribution of contributions from the core-set. Experimental results demonstrate that MRMC improves training efficiency significantly while preserving model performance with minimal cost.
Lay Summary: In training deep learning models, selecting the most representative data subsets can significantly improve training efficiency and help us better understand the data. However, most existing methods either rely on heuristic rules or require cumbersome computations, making it difficult to achieve both efficiency and effectiveness. To address this issue, we propose a novel approach that automatically balances the importance of different data samples and selects the most valuable data subsets by calculating each sample's actual contribution to the model's training progress. Our experiments demonstrate that this method not only accelerates model training but also maintains performance comparable to using the full dataset, while saving substantial computational resources.
Link To Code: https://github.com/ssssss489/MRMC
Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits
Keywords: Core-set selection, Dataset pruning, Loss reduction attribution
Submission Number: 9639
Loading