Rethinking Dataset Pruning From A Generalization Perspective

Furui Xu; Shaobo Wang; Luo Zhongwei; Linfeng Zhang

Rethinking Dataset Pruning From A Generalization Perspective

Furui Xu, Shaobo Wang, Luo Zhongwei, Linfeng Zhang

Published: 05 Mar 2025, Last Modified: 19 Apr 2025MLDPR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dataset pruning, Coreset selection

TL;DR: The paper introduces UNSEEN, a plug-and-play framework for dataset pruning that focuses on generalization rather than training performance.

Abstract: The growing scale of datasets in deep learning has introduced significant computational challenges. To address this problem, dataset pruning aims to construct an informative coreset from the full dataset with comparable performance. Previous dataset pruning methods are mostly based on the performance of samples during the training (i.e., fitting) phase. In this paper, we rethink dataset pruning from the perspective of generalization, i.e. scoring samples based on models that have not been trained on them. We propose a plug-and-play framework UNSEEN, which can be integrated into existing dataset pruning methods. For instance, the simplest Entropy method achieves accuracy comparable to state-of-the-art (SOTA) methods under our framework. We validate our method on various datasets including CIFAR-10, CIFAR-100, and ImageNet-1K to demonstrate its effectiveness.

Submission Number: 4

Loading