TL;DR: GInX-Eval, an evaluation procedure of GNN in-distribution explanations, overcoming the limitations of faithfulness metrics, which is a tool to validate ground-truth explanations and to inform us about the true informative power of xAI methods.
Abstract: Diverse explainability methods of graph neural networks (GNN) have recently been developed to highlight the edges and nodes in the graph that contribute the most to the model predictions. However, it is not clear yet how to evaluate the *correctness* of those explanations, whether it is from a human or a model perspective. One unaddressed bottleneck in the current evaluation procedure is the problem of out-of-distribution explanations, whose distribution differs from those of the training data. This important issue affects existing evaluation metrics such as the popular faithfulness or fidelity score. In this paper, we show the limitations of faithfulness metrics. We propose **GInX-Eval** (**G**raph **In**-distribution e**X**planation **Eval**uation), an evaluation procedure of graph explanations that overcomes the pitfalls of faithfulness and offers new insights on explainability methods. Using a fine-tuning strategy, the GInX score measures how informative removed edges are for the model and the HomophilicRank score evaluates if explanatory edges are correctly ordered by their importance and the explainer accounts for redundant information. GInX-Eval verifies if ground-truth explanations are instructive to the GNN model. In addition, it shows that many popular methods, including gradient-based methods, produce explanations that are not better than a random designation of edges as important subgraphs, challenging the findings of current works in the area. Results with GInX-Eval are consistent across multiple datasets and align with human evaluation.
Submission Track: Full Paper Track
Application Domain: None of the above / Not applicable
Clarify Domain: All domains that work with graphs
Survey Question 1: We show the limitation of faithfulness, one of the most common XAI evaluation metrics, namely that it evaluates out-of-distribution explanations. In addition, we observe that (1) it is inconsistent with the accuracy metric, (2) it leads to divergent conclusions across datasets, and (3) across edge removal strategies. We propose GInX-Eval, an evaluation procedure of in-distribution explanations that brings new perspectives on GNN explainability methods. GInX-Eval helps to select explainability methods that capture informative graph entities for the model. This includes filtering out bad methods and choosing methods that can correctly rank edges by their importance.
Survey Question 2: We propose a new evaluation procedure for explainability methods. The goal is to show the xAI community that what the scientific community is currently using to assess the performance of explainability methods is not appropriate. Explainability is at the core of our work.
Survey Question 3: To showcase GInX-Eval, we evaluate 3 base edge importance estimators, 8 non-generative explainability methods (Occlusion, GradCAM, Saliency, Integrated Gradient, GNNExplainer (edge or edge and nf), PGMExplainer and SubgraphX) and 5 generative methods (GraphCFE, GSAT, D4Explainer, PGExplainer, RCExplainer)
Submission Number: 8
Loading