Robust Sparsification via Sensitivity

Chansophea Wathanak In; Yi Li; David Woodruff; Xuan Wu

Robust Sparsification via Sensitivity

Chansophea Wathanak In, Yi Li, David Woodruff, Xuan Wu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We construct robust coreset for any optimization problem that admits bounded sensitivity and vanilla coreset.

Abstract: Robustness to outliers is important in machine learning. Many classical problems, including subspace embedding, clustering, and low-rank approximation, lack scalable, outlier-resilient algorithms. This paper considers machine learning problems of the form $\min_{x\in \mathbb{R}^d} F(x)$, where $F(x)=\sum_{i=1}^n F_i(x)$, and their robust counterparts $\min_{x\in\mathbb{R}^d} F^{(m)}(x)$, where $F^{(m)}(x)$ denotes the sum of all but the $m$ largest $F_i(x)$ values. We develop a general framework for constructing $\epsilon$-coresets for such robust problems, where an $\epsilon$-coreset is a weighted subset of functions $\{F_1(x),\dots,F_n(x)\}$ that provides a $(1+\epsilon)$-approximation to $F(x)$. Specifically, if the original problem $F$ has total sensitivity $T$ and admits a vanilla $\epsilon$-coreset of size $S$, our algorithm constructs an $\epsilon$-coreset of size $\tilde{O}(\frac{mT}{\epsilon})+S$ for the robust objective $F^{(m)}$. This coreset size can be shown to be near-tight for $\ell_2$ subspace embedding. Our coreset algorithm has scalable running time and leads to new or improved algorithms for the robust optimization problems. Empirical evaluations demonstrate that our coresets outperform uniform sampling on real-world data sets.

Lay Summary: The existence of outliers presents formidable challenges for many machine learning problems, including clustering, subspace embedding, and low-rank approximation. This paper considers robust coresets for all the above machine learning problems, which reduce the input dataset into a small subset, thus providing a scalable and robust solution. We design a general framework to construct a robust coreset for any machine learning problem that admits a bounded total sensitivity and vanilla coreset. Our coreset is near-optimal for subspace embedding. Our experimental results show that our coresets outperform the uniform sampling benchmark on real-world data sets.

Primary Area: Theory->Optimization

Keywords: Coreset, Robust Machine Learning, Subspace Embedding, Clustering

Submission Number: 1617

Loading