Towards General Robustness to Bad Training Data

Tianhao Wang; Yi Zeng; Ming Jin; Ruoxi Jia

Towards General Robustness to Bad Training Data

Tianhao Wang, Yi Zeng, Ming Jin, Ruoxi Jia

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: General Robustness, Data Valuation, Data Utility Learning

Abstract: In this paper, we focus on the problem of identifying bad training data when the underlying cause is unknown in advance. Our key insight is that regardless of how bad data are generated, they tend to contribute little to training a model with good prediction performance or more generally, to some utility function of the data analyst. We formulate the problem of good/bad data selection as utility optimization. We propose a theoretical framework for evaluating the worst-case performance of data selection heuristics. Remarkably, our results show that the popular heuristic based on the Shapley value may choose the worst data subset in certain practical scenarios, which sheds lights on its large performance variation observed empirically in the past work. We then develop an algorithmic framework, DataSifter, to detect a variety of and even unknown data issues---a step towards general robustness to bad training data. DataSifter is guided by the theoretically optimal solution to data selection and is made practical by the data utility learning technique. Our evaluation shows that DataSifter achieves and most often significantly improves the state-of-the-art performance over a wide range of tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

One-sentence Summary: We formulate the problem of good/bad data selection as utility optimization, propose a theoretical framework for analyzing data selection heuristics, and develop an algorithmic framework to detect a variety of and even unknown data issues.

23 Replies

Loading