Dataset Pruning: Optimizing Image Datasets with a Cross-Validation Method
Abstract: In the field of image recognition, the scale and diversity of datasets are crucial for model training. This study proposes a novel cross-validation dataset pruning method with data balancing (CVDP-DB). The method evaluates the correctness of training samples and their predicted probabilities, scoring them and subsequently performing precise pruning. We focus on pruning hard samples that are incorrectly predicted and easy samples that are correctly predicted with high probability, while retaining samples with high prediction uncertainty. These samples form a refined coreset for training models. This approach not only optimizes the feature distribution of the dataset but also enhances the model’s ability to recognize key samples. Experimental results show that the CVDP-DB method demonstrates excellent classification performance across various models and datasets. Notably, the applicability of CVDP-DB transcends the constraints of specific models or datasets, and the method surpasses state-of-the-art (SOTA) technologies.
Loading