Geometric Median (GM) Matching for Robust Data Pruning

Anish Acharya; Inderjit S Dhillon; Sujay Sanghavi

Geometric Median (GM) Matching for Robust Data Pruning

Anish Acharya, Inderjit S Dhillon, Sujay Sanghavi

17 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: data pruning, robust, data selection

TL;DR: We propose Geometric Median ( GM) Matching – a novel data pruning strategy that remains robust even when up to 1/2 fraction of the data is arbitrarily corrupted.

Abstract: Data pruning, the combinatorial task of selecting a small and informative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. In response, we propose $\gmm$ -- a herding~\citep{welling2009herding} style greedy algorithm -- that {\em yields a $k$-subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset}. Theoretically, we show that $\gm$ Matching enjoys an improved $\gO(1/k)$ scaling over $\gO(1/\sqrt{k})$ scaling of uniform sampling; while achieving the optimal breakdown point of 1/2 even under arbitrary corruption. Extensive experiments across popular deep learning benchmarks indicate that $\gm$ Matching consistently outperforms prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making it a strong baseline for robust data pruning.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1402

Loading