A Simple Scoring Function to Fool SHAP: Stealing from the One Above

Published: 27 Oct 2023, Last Modified: 22 Nov 2023NeurIPS XAIA 2023EveryoneRevisionsBibTeX
TL;DR: XAI methods such as SHAP can identify unfairness in black-box models by revealing the impact of protected attributes, we propose a simple adversarial scoring function that can bypass SHAP's detection.
Abstract: Explainable Al (XAl) methods such as SHAP can help discover unfairness in black-box models. If the XAl method reveals a significant impact from a "protected attribute" (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial models can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution. We propose a simple rule that does not require access to the underlying data or data distribution. It can adapt any scoring function to fool XAl methods, such as SHAP. Our work calls for more attention to scoring functions besides classifiers in XAl research and reveals the limitations of XAl methods for explaining behaviors of scoring functions.
Submission Track: Full Paper Track
Application Domain: Social Science
Survey Question 1: A scoring function takes data and outputs a score or a vector of scores. Such a vector of scores can then be converted into groups for classification or orders for ranking. In our work, we construct scoring functions that have hidden unfair behaviors and can bypass the detection of XAI methods, such as SHAP.
Survey Question 2: A scoring function, if not explained based on the input data, is subject to misuse or unfair use.
Survey Question 3: SHAP explanation is used.
Submission Number: 47