Data Debugging is NP-hard for Classifiers Trained with SGD

15 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: data debugging, machine learning interpretations, computational complexity
Abstract: Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named \textsc{Debuggable}. Given a machine learning model $\mathcal{M}$ obtained by training on dataset $D$ and a test instance $(\mathbf{x}_\text{test},y_\text{test})$ where $\mathcal{M}(\mathbf{x}_\text{test})\neq y_\text{test}$, \textsc{Debuggable} is to determine whether there exists a subset $D^\prime$ of $D$ such that the model $\mathcal{M}^\prime$ obtained by retraining on $D^\prime$ satisfies $\mathcal{M}^\prime(\mathbf{x}_\text{test})=y_\text{test}$. To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, \textsc{Debuggable} is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of \textsc{Debuggable} is provided; (3) If the loss function is a linear function, \textsc{Debuggable} can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.
Primary Area: Interpretability and explainability
Submission Number: 18381
Loading