All-Purpose Mean Estimation over R: Optimal Sub-Gaussianity with Outlier Robustness and Low Moments Performance

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We consider the basic statistical challenge of designing an "all-purpose" mean estimation algorithm that is recommendable across a variety of settings and models. Recent work by [Lee and Valiant 2022] introduced the first 1-d mean estimator whose error in the standard finite-variance+i.i.d. setting is optimal even in its constant factors; experimental demonstration of its good performance was shown by [Gobet et al. 2022]. Yet, unlike for classic (but not necessarily practical) estimators such as median-of-means and trimmed mean, this new algorithm lacked proven robustness guarantees in other settings, including the settings of adversarial data corruption and heavy-tailed distributions with infinite variance. Such robustness is important for practical use cases. This raises a research question: is it possible to have a mean estimator that is robust, *without* sacrificing provably optimal performance in the standard i.i.d. setting? In this work, we show that Lee and Valiant's estimator is in fact an "all-purpose" mean estimator by proving: (A) It is robust to an $\eta$-fraction of data corruption, even in the strong contamination model; it has optimal estimation error $O(\sigma\sqrt{\eta})$ for distributions with variance $\sigma^2$. (B) For distributions with finite $z^\text{th}$ moment, for $z \in (1,2)$, it has optimal estimation error, matching the lower bounds of [Devroye et al. 2016] up to constants. We further show (C) that outlier robustness for 1-d mean estimators in fact implies neighborhood optimality, a notion of beyond worst-case and distribution-dependent optimality recently introduced by [Dang et al. 2023]. Previously, such an optimality guarantee was only known for median-of-means, but now it holds also for all estimators that are simultaneously *robust* and *sub-Gaussian*, including Lee and Valiant's, resolving a question raised by Dang et al. Lastly, we show (D) the asymptotic normality and efficiency of Lee and Valiant's estimator, as further evidence for its performance across many settings.
Lay Summary: Suppose we have a large population of numbers (say, the individual income of people in a country), and we're trying to estimate the population mean via sampling. The conventional method is to take a bunch of samples, and just compute the average of the samples in the hopes that it is a reasonable extrapolation. However, the sample average is very sensitive to extreme values, which might occur in our data set if we get unlucky in our sampling. Moreover, real-world data sampling can introduce errors, for example through mistakes in data entry or even through malicious meddling from bad actors. This paper mathematically proves that the recent Lee and Valiant mean estimator achieves essentially the smallest possible error in a wide variety of settings, including in badly-behaved populations where extreme values are relatively common, and also in settings where there is data corruption.
Primary Area: Theory->Learning Theory
Keywords: mean estimation, instance optimality, robust statistics
Submission Number: 13863
Loading