Keywords: data pruning; coreset selection; noise label learning; data centric-ai
TL;DR: In this paper, we introduce a novel robust data pruning method tailored to align with real-world benchmarks, specifically addressing datasets that contain noisy labels.
Abstract: Data pruning aims to prune large-scale datasets into concise subsets, thereby reducing computational costs during model training.
While a variety of data pruning methods have been proposed, most focus on meticulously curated datasets, and relatively few studies address real-world datasets containing noisy labels. In this paper, we empirically analyze the shortcomings of previous gradient-based methods, revealing that geometry-based methods exhibit greater resilience to noisy labels. Consequently, we propose a novel two-stage noisy data pruning method that incorporates selection and re-labeling processes, which takes into account geometric neighboring information. Specifically, we utilize the distribution divergence between a given label and the predictions of its neighboring samples as an importance metric for data pruning. To ensure reliable neighboring predictions, we employ feature propagation and label propagation to refine these predictions effectively. Furthermore, we utilize re-labeling methods to correct selected subsets and consider the coverage of both easy and hard samples at different pruning rates. Extensive experiments demonstrate the effectiveness of the proposed method, not only on real-world benchmarks but also on synthetic datasets, highlighting its suitability for practical applications with noisy label scenarios.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1802
Loading