TL;DR: We present the first polynomial-time algorithm that achieves optimal error for high-dimensional mean estimation in the mean-shift contamination model.
Abstract: We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of mean-shift contamination. In this contamination model, we are given a set of points in $\mathbb{R}^d$ generated i.i.d. via the following process. For a parameter $\alpha<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-\alpha$, $x_i$ is drawn from $\mathcal{N}(\mu, I)$, where $\mu \in \mathbb{R}^d$ is the target mean; and with probability $\alpha$, $x_i$ is drawn from $\mathcal{N}(z_i, I)$, where $z_i$ is unknown and potentially arbitrary. Prior work characterized the information-theoretic limits of this task. Specifically, it was shown that— in contrast to Huber contamination— in the presence of mean-shift contamination consistent estimation is possible. On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension. Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers. In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to any desired accuracy. Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.
Lay Summary: Computing the mean over a collection of samples is a fundamental task that underlies many algorithms in both theory and practice. However, samples are often corrupted, and a substantial body of work focuses on cases where those corruptions are completely arbitrary. In such settings, estimating an accurate mean is considerably more difficult—and in fact, perfect recovery of the true mean is impossible.
In our work, we assume a more structured corruption model: each corrupted sample contains some added noise rather than an arbitrary outlier. Under this assumption, we design an efficient procedure that estimates the mean with near‐perfect accuracy.
We believe this approach sheds light on the challenge of computing the mean in the presence of structured (rather than arbitrary) corruption—a problem of broad practical importance.
Primary Area: Theory->Learning Theory
Keywords: mean estimation, high-dimensional inference, robust statistics, contamination, computational efficiency
Submission Number: 12278
Loading