Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Yeseul Cho; Baekrok Shin; Changmin Kang; Chulhee Yun

Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: For dataset pruning, we introduce the DUAL (Difficulty and Uncertainty-Aware Lightweight) score, a new method that identifies important training examples early in the training process by considering both example difficulty and prediction uncertainty.

Abstract:

Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce the Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning ratio, we further propose a pruning ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.

Lay Summary:

Training modern AI models requires massive datasets and days of computation. Data pruning—removing less important data—can cut training costs while preserving performance. However, most recent pruning methods overemphasize accuracy, often consuming more resources than training on the full dataset.

We introduce a faster, cost-effective pruning technique called "DUAL score", which evaluates both difficulty and uncertainty of each data point early in training. By combining these signals, DUAL score identifies less informative examples with a decrement in overhead. Data is adaptively pruned based on a prespecified pruning ratio: the higher the ratio, the more likely easier data points are selected for the training.

Our method revives the original goal of data pruning—to reduce training cost—without sacrificing accuracy. Experiments on CIFAR and ImageNet-1k demonstrate that DUAL score achieves state-of-the-art accuracy even when pruning 90% of the data while reducing training time by 33% on ImageNet-1k and 85% on CIFAR.

Link To Code: https://github.com/behaapyy/dual-pruning

Primary Area: General Machine Learning->Supervised Learning

Keywords: Dataset Pruning, Coreset Selection, Example Difficulty, Prediction Uncertainty

Submission Number: 15764

Loading