Data Glitches Discovery using Influence-based Model Explanations

Nikolaos Myrtakis; Ioannis Tsamardinos; Vassilis Christophides

Data Glitches Discovery using Influence-based Model Explanations

Nikolaos Myrtakis, Ioannis Tsamardinos, Vassilis Christophides

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Training Data Debugging, Data Debugging for ML, Mislabeled Samples, Anomalies, Influence Functions, Influence Signals

TL;DR: We build signals exploiting samples' influence in a model's decision boundary, to detect/explain/repair glitches (mislabels or anomalies). The signals are accurate and robust across different models.

Abstract: We address the problem of detecting data glitches in ML training sets, specifically mislabeled and anomalous samples. Detection of data glitches provides insights into the quality of the data sampling. Their repair may improve the reliability and performance of the model. The proposed methodology is based on exploiting influence functions that estimate how much the loss of the model (or a given sample) is affected when a sample is removed from the training set. We introduce three novel signals for detecting, characterizing, and repairing data glitches in a training set based on sample influences. Influence-based signals form an explainable-by-design data glitch detection framework, producing intuitively explainable signals of the actual predictive model built. In contrast, specialized algorithms that are agnostic to the target ML model (e.g., anomaly detectors) replicate the work of fitting the data distribution and may detect glitches that are inconsistent with the decision boundary of the predictive model. Computational experiments on tabular and image data modalities demonstrate that the proposed signals outperform, in some cases up to a factor of 6, all existing influence-based signals and generalize across different datasets and ML models. In addition, they often outperform specialized glitch detectors (e.g., mislabeled and anomaly detectors) and provide accurate label repairs for mislabeled samples. This work has been accepted for publication in the research track of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) https://dl.acm.org/doi/10.1145/3690624.3709285 Keywords: Other

Submission Number: 48

Loading