Approximations to worst-case data dropping: unmasking failure modes

Approximations to worst-case data dropping: unmasking failure modes

NeurIPS 2024 Workshop ATTRIB Submission1 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0

Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.

Keywords: Influence Function, Sensitivity Analysis, Linear Regression, Masking, Robust Statistics

Abstract: A data analyst would worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-runs the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one data point may hide or conceal the effect of another data point. We provide recommendations for users and suggest directions for future development.

Submission Number: 1

Loading