Abstract: Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.
For data with inliers generated from a Gaussian with known covariance, we find experimentally that several robust mean estimation techniques can practically improve upon the sample mean, with the quantum entropy scaling approach from Dong \etal (NeurIPS 2019) performing consistently the best. However, this consistent improvement is conditioned on a couple of simple modifications to how the steps to prune outliers work in the high-dimension low-data setting, and when the inliers deviate significantly from Gaussianity. In fact, with these modifications, they are typically able to achieve roughly the same error as taking the sample mean of the uncorrupted inlier data, even with very low data size. In addition to controlled experiments on synthetic data, we also explore these methods on large language models, deep pretrained image models, and non-contextual word embedding models that do not necessarily have an inherent Gaussian distribution. Yet, in these settings, a mean point of a set of embedded objects is a desirable quantity to learn, and the data exhibits the high-dimension low-data setting studied in this paper. We show both the challenges of achieving this goal, and that our updated robust mean estimation methods can provide significant improvement over using just the sample mean. We additionally publish a library of Python implementations of robust mean estimation algorithms, allowing practitioners and researchers to apply these techniques and to perform further experimentation.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: This is a substantial update in response to the reviews. Changed text is shown in a light purple. Here is a short summary of changes. For a longer discussion, see review responses:
1. We changed the title to highlight the paper's large empirical study.
2. We reworked the introduction and background section to better highlight the contribution and stream-line the discussion
3. We have added more background on word embeddings as a preface to the study of mean estimation on data in that form.
4. We added new experiments, and a new section (Section 6) on robust mean estimation experiments where the inlier data set is non-Gaussian. For multivariate-t and mixture of Gaussian distributions, results differ noticeably from the Gaussian inliers -- while in other cases it was about the same.
5. A number of minor or localized update throughout the paper.
Video: https://www.youtube.com/watch?v=Tp_1rFTRMBI&t=46s
Code: https://github.com/cullena20/RobustMeanEstimation
Supplementary Material: zip
Assigned Action Editor: ~Matthew_J._Holland1
Submission Number: 3521
Loading