TL;DR: This paper proposes Geometric Median Matching -- a robust data pruning method that selects representative subsets by matching their mean to the geometric median, ensuring resilience to extreme noise and corruption.
Abstract: Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.
Lay Summary: Today’s AI systems are trained on enormous datasets — millions of images, texts, or videos — but not all that data is helpful. In fact, a lot of it is messy: mislabeled, corrupted, or just plain misleading. Training on this kind of data not only wastes time and energy, it can actively hurt performance.
So, what if we could pick only the right data — the most reliable, representative examples — and throw out the rest ?
That’s the idea behind this paper. We introduce Geometric Median Matching, a new way to carefully prune down large datasets to their cleanest, most informative core. Unlike existing methods that rely on averages — which can be easily thrown off by just a few bad examples — we use a more robust tool called the geometric median. It finds the “true center” of the data in a way that naturally resists noise and outliers. We then select a small subset of the data that best matches this stable center. The result is a dramatically smaller training set that still teaches the model everything it needs to know — and often does it better. Even if up to half the data is corrupted, our method still works. Across tasks like image recognition and image generation, this approach trains models faster, uses less compute, and produces better results.
In short: we show how AI can learn more by training on less — if you choose the right data.
Link To Code: https://github.com/anishacharya/GM-Matching
Primary Area: Deep Learning->Robustness
Keywords: robust, subset, sampling, efficiency, coreset, combinatorial, noise, real world, corruption
Submission Number: 7988
Loading