Distilling the Knowledge in Data Pruning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.
Lay Summary: Modern artificial intelligence models are trained on massive datasets, but not all of that data is equally important. Researchers have been exploring ways to shrink these datasets — a process called data pruning — to make training faster and cheaper. The challenge is doing so without hurting the model’s performance. Our work shows that combining pruning with a technique called knowledge distillation can solve this problem. Knowledge distillation means that, instead of just learning from the original labels, the model also learns from the predictions made by a more experienced "teacher" model trained on the full dataset. This extra guidance helps the new model stay accurate even when trained on far less data. Surprisingly, we found that even randomly selected data can work as well as — or better than — carefully chosen examples, as long as knowledge distillation is used. We also discovered how to adjust this technique depending on how much data is removed, and even found cases where using a smaller teacher model worked better than a larger one. This makes AI training more efficient and practical for real-world use.
Primary Area: Deep Learning->Algorithms
Keywords: Data pruning, knowledge distillation
Submission Number: 9059
Loading