Assessing Explanation Fragility of SHAP using Counterfactuals

Published: 05 Nov 2025, Last Modified: 05 Nov 2025NLDL 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Explainability, Counterfactuals, SHAP, DiCE, Fairness
TL;DR: We perform assessment of SHAP's fragility using counterfactuals on four fairness benchmark datasets using three ML models.
Abstract: Post-hoc explanations such as SHAP are increasingly being used to justify the predictions of machine learning models, and especially deep learning models. However, these explanations can be fragile: small, realistic input changes can cause different features to be deemed important. We present a multi-seed, distance-controlled stability assessment of SHAP explanations.. For each instance, we use DiCE to generate plausible counterfactuals, pool across random seeds, deduplicate, and retain the $K$ nearest ones. Using a shared independent masker and the model’s logit (raw margin), we measure per-feature attribution shifts and summarize instance-level instability. On four tabular fairness benchmark datasets, we apply our protocol to a multilayer perceptron, logistic regression, and decision trees, reporting within-model group-wise explanation stability and examine which features most often drive the observed shifts. We report coverage, effective-$K$, distance-to-boundary, and outlier diagnostics to contextualize our findings. The protocol is model-agnostic yet practical for deep networks (batched inference, shared background), turning explanation variability into an actionable fairness assessment without altering trained models.
Serve As Reviewer: ~Sebastian_Mair1
Submission Number: 50
Loading