Dataset Distillers Are Good Label Denoisers In the Wild

Lechao Cheng; KaifengChen; Jiyang Li; Zerun Liu; Shengeng Tang; Shufei Zhang; Zhihui Yang

Dataset Distillers Are Good Label Denoisers In the Wild

Lechao Cheng, KaifengChen, Jiyang Li, Zerun Liu, Shengeng Tang, Shufei Zhang, Zhihui Yang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dataset Distillation

Abstract: Dataset distillation aims to synthesize a small set of informative samples that preserve the generalization ability of large datasets. However, its behavior under noisy conditions remains underexplored. In this paper, we systematically study dataset distillation under three representative noise types: symmetric, asymmetric, and natural noise. We first discover that, when the noise ratio exceeds a critical threshold, mainstream distillation methods consistently outperform training on the full noisy dataset using significantly fewer distilled samples. In contrast, under asymmetric noise, the structured label corruption often entangles with semantic features, making it difficult for distilled samples to recover the clean data distribution. We further validate the effectiveness of dataset distillation on real-world noisy datasets, highlighting its robustness under high noise but degraded performance in low-noise settings due to over-compression. To provide theoretical insights, we derive upper and lower bounds on the required images per class (IPC) under each noise type, grounded in information theory and PAC-Bayes analysis. Our findings offer both empirical and theoretical guidelines for effective distillation in noisy learning scenarios.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 6480

Loading