Abstract: Public datasets, crucial for modern machine learning and statistical inference, often contain low-quality or contaminated data that harms model performance. This motivates the development of principled prefiltering procedures that facilitate accurate downstream learning. In this work, we formalize the problem of **L**earner-**A**gnostic **R**obust data **P**refiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We instantiate this framework in two theoretical settings, providing a hardness result and upper bounds. Our theoretical results indicate that performing LARP on heterogeneous learner sets causes some performance loss compared to individual, learner-specific prefiltering; we term this gap as the price of LARP. To assess whether LARP remains worthwhile, we (i) empirically measure the price of LARP across image and tabular tasks and (ii) introduce a game-theoretic cost model that trades off the price of LARP against the cost of learner-specific prefiltering. The model yields sufficient conditions under which LARP is provably beneficial.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ozan_Sener1
Submission Number: 6679
Loading