Keywords: dataset distillation, machine learning, coreset selection
Abstract: The computational and storage costs of large-scale datasets present a significant bottleneck in modern artificial intelligence (AI). While dataset distillation and coreset selection aim to mitigate this by compressing the original datasets into small ones, both have critical limitations. Dataset distillation produces synthetic images that exhibit architectural overfitting and poor transferability to downstream tasks. Conversely, existing coreset selection methods rely on fixed scoring functions, leading to redundant sample selection and performance saturation as the data budget increases. To address these challenges, we propose Adaptive Coreset Selection (ACS), a novel framework that learns an optimal selection strategy for a given budget. ACS employs a multi-stage approach, first building a foundational set of representative samples and then iteratively training models on the selected images to identify hard samples. This adaptive process ensures the final coreset balances representativeness and diversity. We demonstrate the efficacy of ACS on CIFAR-10 and ImageNet, where it outperforms state-of-the-art dataset distillation and coreset selection methods. Notably, on CIFAR-10 with 200 images-per-class, ACS surpasses all baselines by 2\%p in validation accuracy and shows superior generalization to downstream tasks, establishing it as a more robust and scalable solution for dataset compression.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 11047
Loading