Title: OOD-Resistant Adversarial Robustness (OAR): A Novel Metric for Robust Evaluation of GNN Explanations

Abstract: Reliable evaluation of post-hoc explanations for Graph Neural Networks (GNNs) is critical for their trustworthy deployment, yet conventional metrics often struggle with out-of-distribution (OOD) issues. This work directly confronts this challenge by introducing OOD-resistant Adversarial Robustness (OAR), a novel evaluation metric. Inspired by adversarial robustness, OAR assesses the quality of an explanation subgraph by measuring its robustness under attack, crucially integrating an OOD reweighting block to ensure the evaluation remains within the original data distribution. For large-scale applications, we propose a Simplified OAR (SimOAR), which significantly enhances computational efficiency with minimal performance compromise. Extensive empirical studies across various explanation methods, datasets, and GNN backbones demonstrate the superior effectiveness and consistency of OAR and SimOAR compared to existing removal- and generation-based metrics. Code is available at https://github.com/MangoKiller/SimOAR_OAR. † Liu Wei is equal contribution to this paper.

Section: Introduction
Post-hoc explainability has emerged as a crucial area for interpreting Graph Neural Networks (GNNs) [1,2,3,4], aiming to identify salient subgraphs that justify model predictions and enhance trust, fairness, and understanding [5,6,7]. However, the reliable evaluation of these explanations remains a significant challenge. Traditional approaches, such as human supervision [8,9] and agreement with ground-truth explanations [10,11], suffer from subjectivity, labor-intensiveness, and limited availability of ground truth, restricting their practical applicability.

Figure 1: Pipelines and flaws of different evaluation methods. In the "Input" graph, -NH 2 is considered as the ground truth explanation for its mutagenicity. Best viewed in color.
A prevailing quantitative evaluation paradigm is Feature Removal [12,13], which assesses an explanation's predictive power by removing "unimportant" features and observing the GNN's output on the remaining subgraph. Metrics like Accuracy [8] and Fidelity [9] stem from this idea. Despite their widespread use, removal-based metrics are severely hampered by the notorious out-of-distribution (OOD) issue [14,13]. When features are removed, the resulting subgraphs often deviate significantly from the original data distribution [15,16], forcing the GNN to process off-manifold inputs and potentially yield erroneous or unfaithful predictions [17,18]. For instance, as illustrated in Figure 1 (a), a GNN might correctly classify a full molecular graph as "mutagenic" due to a -NH 2 group. Yet, when presented with only a non-mutagenic C-Cl subgraph, it might still incorrectly predict "mutagenic," undermining the faithfulness of the explanation evaluation.

In response to the OOD issue, Generation-based metrics [18,17] have been proposed. These methods employ generative models [19,20] to "infill" the subgraph, conditioning on it to generate a new full graph that is theoretically closer to the original data distribution. As shown in Figure 1 (b), the evaluation then compares predictions on this new graph with those on the original. While conceptually appealing, generative models often inherit and amplify data biases, injecting them into the infilling process. For example, if molecules in the Mutagenicity dataset with non-mutagenic chloride (-Cl) frequently co-occur with amino (-NH 2 ) groups, a generative model might erroneously infill -NH 2 when given a -Cl-involved subgraph. This bias not only distorts the generated graphs but also leads to inconsistencies with the GNN's behavior: the generative model might assign high "mutagenic" scores to -Cl due to its co-occurrence with -NH 2 , whereas the GNN itself finds no mutagenic cues in -Cl. Thus, generation-based metrics, while addressing OOD to some extent, can be inconsistent with GNN behavior and lack precise control over the infilling process.

These limitations of removal- and generation-based metrics (summarized in Figure 1) lead us to a critical question: "Can we devise an evaluation metric that simultaneously respects both the data distribution and the GNN's intrinsic behavior?" To address this, we introduce OAR (OOD-resistant Adversarial Robustness), a novel and robust evaluation framework inspired by adversarial robustness [21,22]. As depicted in Figure 1 (c), OAR comprises two key components: constrained attack and OOD reweighting, which are designed to account for GNN behavior and data distribution, respectively. Specifically:
•   **Constrained Attack:** Drawing from the principle that perturbations on label-irrelevant features should minimally affect GNN predictions, while those on label-relevant features should be impactful [22,21], our attack model applies perturbations *only* to the complementary part of the explanation subgraph. This mechanism inherently controls the "infilling" process, ensuring the explanatory subgraph's influence is isolated and evaluated against targeted, meaningful changes.
•   **OOD Reweighting:** After generating a set of perturbed graphs, this component estimates an "OOD score" for each, quantifying its deviation from the original data distribution. These OOD scores are then used to reweight the GNN's predictions on the perturbed graphs. By summing these weighted predictions, OAR quantitatively assesses the importance of the target subgraph, effectively marginalizing OOD instances.

We conduct extensive empirical studies, validating OAR's effectiveness across various state-of-the-art explanation methods, diverse datasets, and different GNN backbones. OAR consistently demonstrates superior alignment with metrics like Precision, Recall, and human supervision, outperforming existing removal- and generation-based methods. Furthermore, to facilitate scalability for large datasets, we introduce a Simplified version of OAR (SimOAR), which achieves significant computational efficiency improvements with a minimal trade-off in performance. Our main contributions are summarized as follows:
•   We propose OAR, a novel evaluation metric for GNN explainability, which effectively resolves the limitations of current removal- and generation-based approaches by explicitly considering both data distribution and GNN behavior (Section 2.2).
•   We introduce SimOAR, a simplified yet highly efficient variant of OAR, designed for large-scale evaluation tasks. SimOAR significantly reduces execution time while maintaining strong performance (Section 2.3).
•   Comprehensive experimental results demonstrate that OAR and SimOAR consistently outperform contemporary evaluation metrics by a substantial margin, further highlighting SimOAR's computational efficiency (Section 3).

Section: Methodology
In this section, we propose an evaluation method for the explainability of GNNs from the perspective of adversarial robustness. We start with the notation of GNNs and its explainability in Section 2.1.
After that, we detail our evaluation metric, OAR via three progressive steps (Section 2.2). In Section 2.3, we provide a simplified version of OAR called SimOAR for applications demanding more efficient execution.

Section: Problem Formulation
Graph neural networks (GNNs). GNNs have achieved remarkable success due to their powerful representation ability. Without loss of generality, we focus on the graph classification task in this work: a well-trained GNN model f takes a graph G as the input and outputs its probabilities y over classes {1, ..., C}, i.e., y = f (G) ∈ R C . Typically, G is an undirected graph involving the node set V and the edge set E. We first introduce the feature of node v i ∈ V as a d-dimensional vector and collect the features of all nodes into X ∈ R |V|×d . Then we define an adjacency matrix A ∈ R |V|×|V| to describe graph topology, where A uv = 1 if the edge connecting nodes u and v exists, i.e., (u, v) ∈ E, otherwise A uv = 0. Based on these, G can be alternatively represented as G = (A, X).
Explainability for GNNs. Upon the GNN model f , explanation techniques of GNNs generally study the underlying relationships between their outputs y and inputs G. They focus on explainability w.r.t. input features, aiming to answer "Which parts of the input graph contribute most to the model prediction?". Towards this end, explainers typically assign an importance score to each input feature (i.e., node v i or edge (v i , v j )) to trace their contributions. Then they select the salient part (e.g., a subset of nodes or edges with top contributions) as the explanatory subgraph G s and delete the complementary part G s = G\G s . We formulate the explanation method as h and yield the above process as G s = h(G, f ).

Section: OOD-Resistant Adversarial Robustness
Retrospecting the removal-and generation-based evaluations, we emphasize that both these classes come with inherent limitations. Specifically, Removal-based metrics pay less heed to the data distribution thus forcing GNNs to handle off-manifold instances, while generation-based metrics are inconsistent with GNN behavior and lose control of the infilling part. Fortunately, in this section, we claim that it is possible to get the best of and avoid the pitfalls of both worlds -removal-based and generation-based metrics -by taking both GNN behavior and data distribution into account.
To meet these challenges, we elaborate our evaluation metric, OOD-resistant adversarial robustness (OAR) via three progressive steps: in the first step, we formulate the adversarial robustness tailored for GNNs explanations, which naturally conforms to the GNN behavior; in the second step, we introduce a tractable and easy-to-implement objective of above adversarial robustness; in the third step, we introduce an elaborate OOD reweighting block which confines the overall evaluation process to the original data distribution.
STEP 1: Formulation of Adversarial Robustness. We prioritize the introduction of adversarial robustness in machine learning [21,22,23] that motivates our method. Concretely, given a machine learning model, an input x and a subset of the input x s ⊆ x, the adversarial robustness of x s denotes the minimum perturbation leading to the wrong prediction, on the condition that perturbation is only imposed on x s [22]. Inspired by this idea, we define the adversarial robustness of GNN explanation G s , and formulate it as the minimum adversarial perturbation δ on the structure of complementary subgraph G s. More formally, Definition 1. Given a GNN model f , an input graph G = (A, X) with prediction y and explanation G s , suppose that G ′ = (A ′ , X ′ ) is the graph generated by adding and deleting edges in G, the adversarial robustness δ of explanation G s is defined as:
δ Gs = min A ′ u∈V v∈V\u |A uv -A ′ uv | s.t. arg max i f (G ′ ) i ̸ = arg max i y i , u∈Vs v∈Vs\u |A uv -A ′ uv | = 0,(1)
where V and V s are the node sets of G and G s , respectively.
Definition 1 identifies the quality of explanation G s as the difficulty of reversing the prediction by perturbing features not belonging to G s solely. That is, when G s is fixed, the more difficult it is to fool the model by perturbing its complementary, the more important G s is. The key intuition behind this inference is: if an explanation comprises most of the label-relevant features, it is conceivably hard to change the prediction by manipulating the remaining features that are label-irrelevant. Thus, according to Definition 1, we can find that: a good explanation would yield high adversarial robustness δ and vice versa, which naturally conforms to the GNN behavior.
It seems that adversarial robustness is the feasible metric to evaluate the GNNs' explanations. However, there are still two matters standing in the way of its adoption: 1) Is its objective (i.e., Equation ( 1)) tractable and easy to implement? 2) Does it respect the data distribution?
STEP 2: Finding a Tractable Objective. To answer the first question, we argue that Equation ( 1) is hard to realize and sometimes even intractable owing to two possible reasons:
• The primary reason is that adversarial attacks may fail to find any adversarial example, since a solution satisfying two conditions in Equation (1) simultaneously may not exist. In other words, if the explanation G s is precise enough, it is almost impossible to reverse the prediction via manipulating features in the complementary part G s which are mainly label-irrelevant. • It is notoriously hard to search for the minimum adversarial perturbation δ in most cases. Current attack methods [24,25] typically turn to find an alternative sub-optimal solution. Thus, leveraging these methods could introduce additional bias and threaten the fairness of evaluation.
To address these issues and make the evaluation objective tractable and easy to implement, we first formulate the inference derived from Definition 1 : Proposition 1. When the high quality explanation G s is anchored (fixed), perturbations restricted to the complementary part G s have a weak influence on the model prediction.
While Definition 1 evaluates the explanations via the adversarial robustness δ, Proposition 1 indicates a more straightforward way to the tractable evaluation objective from its dual perspective. To be more specific, Definition 1 quantifies perturbation on G s causing change of prediction; conversely, Proposition 1 quantifies change of prediction caused by perturbation on G s. More formally, Definition 2. Given a GNN model f , an input graph G = (A, X) with prediction y and explanation G s , suppose that G ′ = (A ′ , X ′ ) is the graph generated by adding and deleting edges in G, the approximate adversarial robustness δ * of G s is defined as:
δ * Gs = E G ′ (f (G ′ ) c -y c ) s.t. c = arg max i y i , u∈Vs v∈Vs\u |A uv -A ′ uv | = 0,(2)
where V s refers to the node set of G s ; c denotes the predicted class of G by model f ; f (G ′ ) c represents the probability value of f (G ′ ) for the given class c.
Stemmed from Definition 2, the evaluation method to quantify the adversarial robustness of GNN explanations is more explicit and computationally convenient. As shown in Figure 2, for the to-beevaluated subgraph G s (i.e., C-Cl in red dotted box), we anchor it and randomly perturb the remain part G s to get graphs G ′ (i.e., molecules in green dotted box). Then we compare the expectation of the prediction of G ′ with the prediction of the original graph G. If they are close, most features in G s must be label-irrelevant. Hence, we can assign high quality for the explanation G s .
So far, there is only one question left: how to ensure that the aforementioned evaluation process respects the data distribution?
STEP 3: OOD Reweighting Block Tailored for GNNs. Before elaborating our OOD block, we first retrospect that in most scenarios, a tiny perturbation would not induce large distribution shifts along the input space, thanks to the approximate continuity of input features (e.g., image pixels in computer vision). Unfortunately, the structural features of GNNs' inputs -adjacency matrix comprising of 0s and 1s -are discrete thus only one perturbation (e.g., adding or deleting an edge) could induce large distribution shifts, and further violate the underlying properties, such as node degree distribution [26], graph size distribution [27] and domain-specific constraints [28].
Thus, it is pivotal to construct an OOD reweighting block for assessing whether the generated graph G ′ deviates from the data manifold. This block is expected to assign an "OOD score" -the degree of distribution shift between G ′ and original graph G -to each G ′ . However, it is non-trivial to quantify the degree of OOD [29]. Inspired by the great success of graph anomaly detection [30,31,32,33,34], we treat the OOD instance as the "anomaly" since it is isolated from the original data distribution, and naturally employ the common usage module of anomaly detection -variational graph auto-encoder (VGAE) [19] containing an encoder and a decoder -to instantiate our OOD reweighting block. Additionally, the great success of diffusion generative models has been recognized recently [35,36], and thus this domain is deferred for future investigation.
Concretely, as shown in Figure 2, the preparation for evaluations is training the VGAE model on the dataset D where input graph G is sampled. After that, we can leverage the reconstruction loss to approximate the degree of OOD for any generated instance G ′ . To be more specific:
• Given G ′ = (A ′ , X ′ ), the encoder first learns a latent matrix Z according to A ′ and X ′ , where row z i corresponds to the node v ′ i in G ′ . Note that z i is assumed to follow the independent normal distributions with expectation µ i and variance σ 2 i . Formally:
q(Z|A ′ , X ′ ) = |V ′ | i=1 q(z i |A ′ , X ′ ) = |V ′ | i=1 N (z i | µ i , diag(σ 2 i )),(3)
where µ and σ are parameterized by two two-layer GCNs [37] called GCN µ and GCN σ . • Then, the decoder recovers the adjacency matrix A ′ based on Z:
p(A ′ |Z) = |V ′ | i=1 |V ′ | j=1 p(A ′ ij | z i , z j ), with p(A ′ ij = 1 | z i , z j ) = σ(z i ⊤ z j ),(4)
where σ(•) is the logistic sigmoid function. • The OOD score of G ′ is given by the normalized reciprocal of the reconstruction loss L recon (G ′ ),
L recon (G ′ ) = -log p(A ′ | Z), with Z = µ = GCN µ (A ′ , X ′ ). (5
)
Since VGAE is trained on the dataset D, G ′ straying far from the data distribution of D would get the high reconstruction loss L recon . Thus, as the reciprocal of L recon , the OOD score of G ′ must be low. Conversely, if G ′ is in distribution, it would gain a high OOD score because it is easy to be reconstructed. Based on this, our OOD block can mitigate the impact of OOD instances. Specifically, the OOD score is utilized as the weight of each prediction when calculating the expectation of the generated graph's prediction. This allows for the marginalization of instances with low OOD scores, as shown in the gray dotted box of Figure 2.
Overall evaluation process. As the last piece of the OAR puzzle, i.e., OOD reweighting block has been instantiated, let's revisit Figure 2 and summarize the overall process of OAR:
1. Before we evaluate the explanatory subgraph G s , the OOD reweighting block (i.e., VGAE) is trained on the dataset D where input graph G is sampled. 2. Then, we fix the G s and randomly perturb the structure of the complementary part G s to get G ′ . 3. Each G ′ is fed into GNN f and VGAE simultaneously to audit prediction and OOD score. Both GNN's behavior and data distribution are taken into consideration in this step. 4. At last, according to the predictions and their weights (i.e., OOD scores), we calculate the weighted average of the generated graphs' predictions. The closer this average is to the original prediction of G, the higher the quality of the explanation G s is.
The pseudocode and the tricks to expedite computations are detailed in Appendix A.

Section: A simplified version of OAR
To better generalize to large datasets and reduce the computational complexity, we provide a simplified version of OAR called SimOAR in this section. Compared with OAR, SimOAR achieves a significant improvement in computational efficiency at the expense of a small amount of performance degradation. Concretely, SimOAR is mainly motivated by three empirical inferences after executing OAR:
• The most time-consuming part of OAR is its preparatory work, i.e., training OOD reweighting block, especially for large datasets. For example, on the dataset MNIST superpixels [38] containing 70,000 graphs, the converged process of VGAE occupies 93.7% of OAR's execution time. • In the course of generating G ′ , the number of perturbation operations is roughly proportional to the degree of distribution shift given by the OOD block. For example, the graph G ′ 1 created via deleting one edge typically gets a higher OOD score than the graph G ′ 2 created via deleting five edges. • For two generated graphs generated via the same perturbation times, they generally get similar reconstruction losses and are assigned similar OOD scores.
Based on these, to expedite computations and simplify the OAR, we deactivate the OOD reweighting block (i.e., deleting all the sketches in gray dotted boxes in Figure 2) in OAR. As compensation for data distribution, we restrict the ratio of the number of perturbations to the number of edges in the original graph G to a pre-defined minor value R. Since the generated graphs typically share similar reconstruction losses and OOD scores while R is fixed, we directly calculate their average prediction to approximate their excepted prediction. The pseudocode and more implementation details of SimOAR are provided in Appendix A.
It is worth noting that despite the potential existence of a few generated graphs G ′ in SimOAR that fall outside the distribution, the performance of SimOAR still significantly surpasses that of current evaluation methods w.r.t consistency with both metrics based on ground truth and human supervision. Hence, in light of the efficiency of the SimOAR, we strongly advocate for its adoption as a predominant alternative to prevalent removal-based evaluation metrics. At the heart of SimOARand a central thesis of this paper -is the perspective that, during evaluation, rather than deleting all non-explanatory nodes and then gauging the resultant output variations, it is more insightful to randomly delete a portion of the non-explanatory nodes multiple times and then gauge the average output variations.

Section: Experiments
We present empirical results to demonstrate the effectiveness of our proposed methods OAR and SimOAR. The experiments aim to investigate the following research questions:
• RQ1: How is the evaluation quality of OAR and SimOAR compared to that of existing metrics?
• RQ2: How is the generalization of OAR and SimOAR compared to that of existing metrics?
• RQ3: What is the impact of the designs (e.g., the OOD reweighting block) on the evaluations? 

Section: BA3
Recall 88.12±0.00 (1) 84.53±0.00 (2)  76.83±4.64 (3)  65.32±5.21 (5) 54.77±4.42 (6) 72.90±3.72 (4) -RM 35.39±0.00 (3) 37.67±0.00 (4)  43.32±1.97 (1)  29.25±2.26 (5) 27.78±0.96 (6) 41.24±1.55 (2)  0.73 DSE 43.08±2.31 (1) 41.38±1.75 (2)  22.25±2.14 (6)  37.92±2.52 (3) 24.60±3.57 (5) 29.31±2.96 (4)  0.73 OAR 93.12±4.60 (1) 86.20±3.76 (2)  80.19±1.68 (3)  65.48±3.75 (5) 59.69±2.98 (6) 71.02±4.45 (4)  1.00* SimOAR 84.39±5.70 (1) 83.44±4.81 (2)  62.52±2.25 (3)  50.02±2.87 (6) 55.49±4.22 (5) 60.42±3.32 (4)  0.93

Section: TR3
Recall 82.08±0.00 (1) 77.00±0.00 (2)  60.09±4.97 (4)  55.85±4.70 (5) 44.39±5.57 (6) 74.19±3.30 (3)  -RM 55.08±0.00 (3) 51.15±0.00 (4)  79.08±4.31 (1)  50.45±2.04 (5) 47.72±3.82 (6) 64.60±2.27 (2)  0.67 DSE 48.51±1.00 (1) 37.32±2.35 (4)  44.82±2.90 (2)  33.71±3.72 (6) 35.65±1.92 (5) 39.49±3.31 (3)  0.73 OAR 95.23±3.75 (1) 87.51±6.65 (2)  72.63±5.89 (3)  59.06±4.83 (5) 51.61±3.14 (6) 63.82±5.43 (4)  0.93* SimOAR 88.45±6.04 (1) 76.37±4.54 (2)  53.54±4.46 (6)  68.58±5.80 (4) 62.98±4.00 (5) 75.59±4.27 (3)  0.86

Section: MNIST-sp
Recall 43.98±0.00 (3) 44.39±0.00 (4)  54.63±0.96 (1)  30.13±1.42 (6) 38.96±2.62 (5) 47.88±1.60 (2)  -RM 21.34±0.00 (4) 19.10±0.00 (5)  22.23±1.02 (3)  25.04±0.35 (2) 27.15±0.69 (1) 17.58±0.50 (6)  0.33 DSE 30.37±4.06 (2) 29.19±2.19 (3)  14.03±1.77 (6)  28.45±2.65 (1) 22.95±2.44 (4) 21.32±1.29 (5)  0.20 OAR 66.28±2.46 (3) 64.18±5.25 (4)  82.22±4.13 (1)  63.88±3.45 (5) 51.37±1.76 (6) 75.43±4.84 (2)  0.93* SimOAR 54.72±3.84 (4) 69.86±2.80 (3)  79.69±3.53 (1)  33.27±2.04 (6) 52.93±3.26 (5) 76.40±2.24 (2)  0.93* MUTAG Recall 41.12±0.00 (6) 44.44±0.00 (5)  55.95±4.94 (4)  71.22±2.54 (2) 64.65±1.50 (3) 77.73±3.67 (1)  -RM 75.60±0.00 (5) 77.39±0.00 (6)  82.32±4.37 (2)  87.11±2.59 (1) 76.49±3.51 (4) 81.76±2.64 (3)  0.73 DSE 38.03±3.90 (4) 32.36±2.40 (5)  40.28±1.32 (3)  49.18±2.57 (1) 29.06±1.44 (6) 43.19±2.14 (2)  0.67 OAR 52.10±3.58 (6) 53.19±3.87 (5)  63.35±2.50 (4)  88.46±2.43 (2) 67.19±2.08 (3) 92.81±5.24 (1)  1.00* SimOAR 75.45±2.51 (5) 71.07±5.40 (6)  85.65±4.00 (2)  81.48±2.98 (3) 76.32±3.88 (4) 89.40±5.06 (1)  0.80

Section: Experimental settings
To evaluate the effectiveness of our method, we utilize four benchmark datasets: BA3 [39], TR3 [17], Mutagenicity [40,41], and MNIST-sp [38], which are publicly accessible and vary in terms of domain and size. Moreover, to generate the explanations of the graphs in the datasets mentioned above, we adopt several state-of-the-art post-hoc explanation methods, i.e., SA [10], GradCAM [42], GNNExplainer [11], PGExplainer [43], CXPlain [44] and ReFine [39]. The prevailing metricsremoval-based evaluation (RM for short) and generation-based evaluation, i.e., DSE [17] -are selected as the baselines. Detailed experimental details can be found in Appendix B.

Section: Measurement metric
We elaborate on the measurement metric of existing evaluation methods in this part since how to fairly define the quality of an evaluation method is critical to our research.
Ground-truth explanations. We first follow the prior studies [11,45,43] and treat the subgraphs coherent to data generation procedure or human knowledge as ground truth. Although ground truth might not conform to the decision-making process exactly, it contains sufficient discriminative information to help justify the quality of explanations. Moreover, it's worth emphasizing again that our method does not depend on ground truth, and gathering ground truth is only for fair comparison.
Consistency with metric based on ground-truth explanations. Specifically, given a to-be-evaluated subgraph G s and its corresponding ground-truth explanation G GT s , we use Recall as the gold evaluation metric defined as
Recall(G s ) = E s E GT s / E GT s
, where E s and E GT s are the edge set of G s and G GT s ; | • | denotes the cardinal function of set. Hence, for any evaluation method, we can calculate its consistency with Recall to quantify its performance via Kendall correlation τ [46] defined as:
τ r i n i=1 , s i n i=1 = 2 n(n + 1) i<j I sgn r i -r j = sgn s i -s j ,(6)
where r i n i=1 and s i n i=1 are a pair of Recall values and evaluation scores; sgn(•) is the sign function and I(•) is the indicator function. The bigger τ is, the higher the evaluation scores are consistent with Recall values, and thus the better the evaluation method is.
Consistency with human intuition. The consistency between evaluation results and human intuition is also an important reference. In view of the high subjectivity of human intuition, we organized a large-scale user study engaging 100 volunteers. Results are shown in Appendix C. Limited by space, we further exhibit and discuss more detailed implementations and results in Appendix B, including but not limited to the generalization of OAR involving the evaluation of the post-hoc explanations for node classification task and the inherent explanations.

Section: Study of explanation evaluation (RQ1)
As a preparation for the experiments, we first collect the explanatory subgraphs {G i s } for a set of graphs {G i } N i=1 and the corresponding well-trained GNN model f . We denote the evaluation score on G i s based on RM, DSE and our OAR and SimOAR by s i RM , s i DSE , s i OAR and s i SimOAR , respectively. For more faithful comparison, we present both the explainer-level correlation defined
as τ * = τ 1 N N i=1 Recall G i,h s h∈H , 1 N N i=1 s i,h * h∈H
and the instance-level correlation
defined as τ * = τ Recall G i s N i=1 , s i * N i=1
, where * can be RM, DSE, OAR, and SimOAR; G i,h s means the subgraph G i s is extracted by explainer h, H is the set of explainers. The explainer-level results, under all evaluation methods, on all datasets, are presented in Table 1. Moreover, considering the instance-level results share a similar tendency, we presented the representative results on BA3 and MNIST-sp in Figure 3 (a). According to Table 1 and Figure 3 (a) we can find that:
• Observation 1: OAR outperforms other methods in all cases. Substantially, Kendall rank correlation greatly improves after leveraging the paradigm of adversarial robustness. The most notable case is the explainers' rank on BA3 and MUTAG, where τ * = 1.00 achieves a tremendous increase from the RM and the DSE. It demonstrates the effectiveness and universality of OAR/SimOAR and verifies that OAR/SimOAR can be leveraged to boost the quality of evaluations. • Observation 2: SimOAR performs a little worse than OAR, but still significantly improves over the strongest baselines. To be more specific, the average score of SimOAR is 7.83% less than that of OAR, but still, 42.62% higher than RM and 37.45% higher than DSE. It demonstrates that SimOAR is adequate for the task of explanation evaluations in most cases.
Further analysis of the results presented in Table 1 reveals that:
• Observation 3: OAR/SimOAR presents a more fair and faithful comparison among explainers.
The rankings provided by the OAR and SimOAR are highly consistent (i.e., τ * = 0.928 on average) with the references, while the removal-and generation-based rankings unsurprisingly pale by comparison. These empirical results give us the courage to leverage OAR and SimOAR to evaluate emerging explanation methods in the future.

Section: Study of generalization (RQ2)
Although the experiment provided in 3.3 is detailed and fair, we contend that the generalization of OAR and SimOAR is still unexamined. Specifically, a specific explainer always has a preference for certain patterns, and thus the explanatory subgraphs extracted by it often have similar structures. Therefore, the experimental results based on limited explainers may not generalize well to other existing or future explainers, especially those based on different lines of thought.
As it is impractical for us to cover all these explainers, we resort to directly generalizing the tobe-evaluated subgraphs. That is, we make use of fake explanatory subgraphs which are randomly sampled from the full graph. The detailed sampling algorithm and settings can be found in Appendix C. In these settings, the best case is that the evaluation score is monotonically increasing w.r.t. the Recall level, which indeed reaches the best consistency. Average normalized scores under all evaluation methods are shown in Figure 3 (b), which indicate that:
• Observation 4: OAR/SimOAR greatly improves the consistency between evaluation scores and Recalls, which indicates that our method has tremendous potential to perform well on other explainers. Conversely, removal-and generation-based methods are negatively correlated with the importance involving the set of to-be-evaluated subgraphs which get high Recalls.

Section: Study of designs (RQ3)
Figure 4: Case studies for OOD reweighting block with the graphs randomly selected from datasets MNIST-sp, MUTAG, and BA3 arranged from top to bottom. Best viewed in color.
Effectiveness of OOD block. We first focus on the effectiveness of the OOD block. The most immediate impact of OOD block on OAR can be estimated by comparing the performance of OAR and SimOAR, which can be also deemed as the ablation experiments. Hence, we turn to qualitatively analyze the OOD block via some case studies shown in Figure 4, where all the OOD scores belonging to the same datasets are normalized to the range of 0 to 1. Based on the information conveyed in Figure 4, the following observation can be made:
• Observation 5: OOD block can assign the lower weights to the subgraph which violent the underlying properties of the full graph. For example, graph properties of chemical molecules, such as the valency rules, impose some constraints on syntactically valid molecules. Hence, the invalid molecular subgraphs, which destroy the integrity of the carbon ring by simply removing some bonds (edges) or atoms (nodes) are assigned low scores by the OOD block. Time complexity. To further explore the efficiency of our evaluation method and the designed module in it, we count the running time of the evaluation process on every single graph and average the time over the entire test set to obtain per-graph time consumption. The comparison is provided in Table 2. According to Table 2 we can find that:
• Observation 6: SimOAR greatly reduces the execution time. Concretely, the execution speed has nearly doubled after leveraging the metric of SimOAR. This significant improvement in efficiency corresponds to the original intention of SimOAR and verifies the success of SimOAR's design.

Section: Related Work & Further Discussion
OAR & Contemporary Evaluation Metrics. Apart from the qualitative evaluation methods based on human intuition, recent literature categorizes quantitative metrics into four primary categories: accuracy, faithfulness, stability, and fairness [47,48,49,50,51,52]. Notably, Precision and Recall metrics align with the accuracy category, while our proposed OAR and SimOAR fall under the faithfulness category. Among these methods, the most recently proposed faithfulness metric is GEF [49], which however omits quantification of the distribution shift in subgraphs. Nevertheless, we have exhibited the experimental comparison between GEF and our metrics w.r.t the latest dataset SHAPEGGEN [49] in Appendix B.

Section: OOD & GNNs Explainability
The OOD issue is one of the most critical challenges in the post-hoc explainability domain currently [16,53]. To sidestep this challenge, many studies have pivoted towards the development of inherently explainable GNNs [15,54]. Notwithstanding the complexity of the task, efforts such as FIDO [55] are making significant successes in addressing the OOD problem within post-hoc explanations. Concurrently, CGE [56] leverages the lottery ticket hypothesis [57,58,59,60,61] to craft the cooperative explanation for both GNNs and graphs, wherein the OOD challenge is potentially mitigated through the EM algorithm. GIBE [62] delves into the intersection of the OOD issue and regularization, viewing it through the lens of information theory. Furthermore, MixupExplainer [63] and CoGE [64] navigate the OOD problem from the generation and recognition stances, respectively.
Simultaneously, the evaluation of inherent explanations encounters the same hurdles as post-hoc explanations: it's challenging to quantify in the absence of the ground truth. Fortunately, by introducing an additionally well-trained GNN, OAR can be employed to evaluate inherent explanations in a similar way to evaluate post-hoc explanations. The experimental results are shown in Appendix B.
Limitations & Concerns. While we acknowledge the effectiveness of our methods, we also recognize their limitations. Concretely, despite utilizing SimOAR to expedite the evaluation process, our paradigm remains more time-intensive compared to the conventional removal-based metric. To overcome this constraint, a probable solution is summarizing the optimal number of perturbations and implementing a self-adaptive extraction module to select the perturbed features.
Furthermore, we recognize potential apprehensions regarding the migration of trust issues from the black-box GNN to the equally non-transparent VGAE. Nevertheless, we posit that the repercussions of this are substantially mitigated in our streamlined method. Specifically, SimOAR bypasses VGAE in favor of employing transparent heuristics for perturbation generation, effectively addressing the aforementioned trust concerns. It's noteworthy that, while SimOAR's performance may be marginally below or comparable to OAR's, it consistently exceeds other benchmarks. This emphasizes VGAE's restrained impact and reaffirms our recommendation of SimOAR over OAR.

Section: Conclusion
In this paper, we explored the evaluation process of GNN explanations and proposed a novel evaluation metric, OOD-resistant adversarial robustness (OAR). OAR gets inspiration from the notion of adversarial robustness and evaluates the quality of explanations by calculating their robustness under attack. It addresses the inherent limitations of current removal-and generation-based evaluation metrics by taking both data distribution and GNN behavior into account. For applications involving large datasets, we introduce a simplified version of OAR (SimOAR), which achieves a significant increase in computational efficiency at the cost of a small amount of performance degradation. This work represents an initial attempt to exploit evaluation metrics for post-hoc GNN explainability from the perspective of adversarial robustness and resistance to OOD.

Section: Ethics Statement
This work is primarily foundational in GNN explainability, focusing on the development of a more reliable evaluation algorithm. Its primary aim is to contribute to the academic community by enhancing the understanding and implementation of the evaluation process. We do not foresee any direct, immediate, or negative societal impacts stemming from the outcomes of our research.

Section: A Algorithms
Algorithm 1 presents the pseudocode of the evaluation process of our proposed method OAR. The pseudocode of SimOAR can be obtained by removing the line 1, 6, and 8 and simply modifying line 9 into "s ← 1 N adv i y (i) ". For clarity, we put the step of feeding adversarial graph G ′(i) into the target GNN and the VGAE under the for-loop. However, when implementing in real code, we can batch all those N adv adversarial graphs and feed them at one time, after the sampling process is finished, to expedite computation. Meanwhile, Algorithm 2 presents the sampling process of fake explanatory subgraphs for general evaluation. G ′(i) ← randomly deleting ⌊R • |E G |⌉ edges from G while fixing G s .
5:
y (i) ← f (G ′(i) ) c .
6:
L (i)
recon ← calculated according to Equation 5 using the trained VGAE. 7: end for 8: w K GT ← the number of edges in the ground-truth explanation of G i 5:
(i) OOD ← 1/L (i) recon j 1/L (j) recon , i = 1, 2, . . . , N adv . 9: s ← i w (i) OOD • y (i) .
K pos ← ⌊L k × K GT ⌉ 6: K neg ← K sub -K pos 7:
for j = 1, 2, . . . , N sub do 8:
G i,j s,k ← a connected subgraph, randomly sampled from G, with K pos edges in the ground truth explanation and K neg edges not in it. Overall, for each dataset, a target GNN classification model is well-trained first. Then the explainers are built on the GNN and generate explanations for its prediction on the dataset. After that, the explanation evaluation methods evaluate how well the explanations are. Our work stands at the last level.
Figure 5: Study cases. For each row, the explanations are ranked based on average rankings given by volunteers. Highlighted evaluation method below each explanation means that the method has given the explanation the highest score compared to other explanations in that row. Best viewed in color.
Target GNNs. GNNs have garnered significant recognition for their prowess in encoding graph data [37,66,67,34,58]. Amidst the vast landscape of GNNs, GIN [67] distinguishes itself with its superior encoding aptitude. In sight of this, the target GNNs for BA3, TR3, and Mutagenicity have the same structure, which is a two-layered GIN followed by a two-layered MLP with 32 hidden channels. They are trained with max epochs equal to 20, 200, and 200 respectively, batch size equal to 128, and learning rate equal to 0.001. The target GNN for MNIST-sp is adapted from an example code1 provided by PyG, trained with the number of epochs equal to 20, batch size equal to 64, and initial learning rate equal to 0.01. Before training, we randomly split BA3, TR3, and Mutagenicity into train and test sets with ratios of 90% and 10%, respectively, while adopting the split provided by PyG for MNIST-sp. During training, we reserve data of the same size as the test set from the train set as the validation set and save the model which reaches the highest classification accuracy on the validation set for later use.
Explainers. We have implemented six state-of-the-art post-hoc explainers, namely, SA, GradCAM, GNNExplainer, PGExplainer, CXPlain, and ReFine, as claimed in Section 3.1, to generate explanatory subgraphs. Here we give a brief introduction to them:
• SA [10] captures the gradients w.r.t. adjacency matrix of the input features in the process of backpropagation and directly treats them as the importance scores. • GradCAM [42] takes one step further over SA via improving the gradients w.r.t. the input features like edges by using context within the graph convolutional layers. • GNNExplainer [11] directly learns an adjacency matrix mask by maximizing the mutual information between a GNN's prediction and distribution of possible subgraph structures. • PGExplainer [43] adopts a deep neural network to parameterize the generation process of explanations, which makes it a natural approach to explaining multiple instances collectively. • CXPlain [44] treats the explanations as a causal learning task and trains causal explanation models that learn to estimate to what degree certain inputs cause outputs in the to-be-explained model. • ReFine [39] leverages the pre-training explanations to exhibit global explanations and the finetuning explanations to adapt the global explanations in the local context.
The default hyper-parameters suggested by those papers are adopted, as whether the explainers are at their optimal state is secondary to our work. Among these explainers, SA, GradCAM, and GNNExplainer directly take the to-be-explained graph as input, while the rest three need to be trained on a set of graphs, i.e., the train set in our case, in advance. We only use the explanations extracted from graphs in the test set for evaluation.
Evaluation Methods. Finally, we arrive at the explanation evaluation level. There are four evaluation methods, i.e., removal-based evaluation, DSE, OAR, and SimOAR, to be considered. Removal-based evaluation directly feeds the explanatory subgraph into the target GNN and gets its prediction, on the predicted class of the original graph, as the evaluation score, which does not involve any details. For DSE, we make use of its public source code2 and follow its paper to set hyper-parameters. As for our method OAR/SimOAR, we have summarized the entire process of OAR in Algorithm 1. Here we present more details on how the VGAE is trained. The encoder involves two two-layered GCNs for obtaining µ and σ, each of which is realized with hidden channels equal to 256 and output channels equal to 128, while the two GCNs share the first layer. The dataset split process is the same as the training of the target GNNs. We train the VGAE model on the train set with the number of epochs equal to 100, batch size equal to 256, and learning rate equal to 0.001. The model that reaches the lowest loss on the validation set is saved for later OOD reweighting.

Section: B.2 More Quantitative Results
Here, we sequentially present the experimental results of 1) comparison between GEF [49] and our metrics w.r.t the latest dataset SHAPEGGEN and the other datasets (Table 3), 2) correlation between metrics and Recall across explanatory subgraphs in four node classification dataset (Table 4), and 3) correlation between metrics and Recall while evaluating the explanations generated by the inherent explainable GNN, GSAT [15] via introducing an additionally well-trained GIN [67] as the model f in Algorithm 1 (Table 5).
Note that while evaluating explanations in node classification tasks, for each node in the input graph, we construct an ego graph for it based on the number of layers in the baseline GNN. Then, the explanation task for node classification can be transferred to the explanation task for graph classification. Furthermore, the remaining hyperparameters and methods in our OAR/SimOAR remain unchanged. 

Section: C User Study
In order to measure the consistency between evaluation results and human intuition, we organized a large-scale user study engaging 100 volunteers. Each volunteer was asked to check 5 groups of graphs, which contain an instance from MNIST-sp, its predicted class, and 5 randomly sampled subgraphs from this instance, and try to rank these 5 subgraphs according to how well they serve as explanations of the prediction based on intuition. We exhibit partial results in Figure 5. According to these results we can find that our evaluation methods i.e. OAR and SimOAR show the highest consistency with human intuition.

Section: Acknowledgments
This work was supported by the National Natural Science Foundation of China (9227010114, 62121002) and the University Synergy Innovation Program of Anhui Province (GXXT-2022-040).


References:
[b0] Vijay Prakash Dwivedi; Chaitanya K Joshi; Thomas Laurent; Yoshua Bengio; Xavier Bresson (2020). Benchmarking graph neural networks. 
[b1] Tianyi Zhao; Yang Hu; Linda R Valsdottir; Tianyi Zang; Jiajie Peng (2021). Identifying drugtarget interactions based on graph convolutional network and deep neural network. Briefings Bioinform
[b2] Zhiwei Guo; Heng Wang (2021). A deep graph neural network-based mechanism for social recommendations. IEEE Trans. Ind. Informatics
[b3] Kun Wang; Yuxuan Liang; Pengkun Wang; Xu Wang; Pengfei Gu; Junfeng Fang; Yang Wang (2023). Searching lottery tickets in graph neural networks: A dual perspective. 
[b4] Minh N Vu; My T Thai (2020). Pgm-explainer: Probabilistic graphical model explanations for graph neural networks. 
[b5] Wanyu Lin; Hao Lan; Baochun Li (2021). Generative causal explanations for graph neural networks. 
[b6] Nicola De Michael Sejr Schlichtkrull; Ivan Cao;  Titov (2021). Interpreting graph neural networks for NLP with differentiable edge masking. 
[b7] Haiyang Hao Yuan; Shurui Yu; Shuiwang Gui;  Ji (2020). Explainability in graph neural networks: A taxonomic survey. 
[b8] Haiyang Hao Yuan; Jie Yu; Kang Wang; Shuiwang Li;  Ji (2021). On explainability of graph neural networks via subgraph explorations. PMLR
[b9] Federico Baldassarre; Hossein Azizpour (2019). Explainability techniques for graph convolutional networks. 
[b10] Zhitao Ying; Dylan Bourgeois; Jiaxuan You; Marinka Zitnik; Jure Leskovec (2019). Gnnexplainer: Generating explanations for graph neural networks. 
[b11] Leo Breiman (2001). Random forests. Mach. Learn
[b12] Giles Hooker; Lucas Mentch (2019). Please stop permuting features: An explanation and alternatives. 
[b13] Haoyang Li; Xin Wang; Ziwei Zhang; Wenwu Zhu (2022). Out-of-distribution generalization on graphs: A survey. 
[b14] Siqi Miao; Mia Liu; Pan Li (2022). Interpretable and generalizable graph learning via stochastic attention mechanism. 
[b15] Ying-Xin Wu; Xiang Wang; An Zhang; Xiangnan He; Tat-Seng Chua (2022). Discovering invariant rationales for graph neural networks. 
[b16] Ying-Xin Wu; Xiang Wang; An Zhang; Xia Hu; Fuli Feng; Xiangnan He; Tat-Seng Chua (2022). Deconfounding to explanation evaluation in graph neural networks. 
[b17] Christopher Frye; Damien De Mijolla; Tom Begley; Laurence Cowton; Megan Stanley; Ilya Feige (2021). Shapley explainability on the data manifold. 
[b18] Thomas N Kipf; Max Welling (2016). Variational graph auto-encoders. 
[b19] P Diederik; Max Kingma;  Welling (2014). Auto-encoding variational bayes. 
[b20] Thomas Fel; Melanie Ducoffe; David Vigouroux; Rémi Cadène; Mikael Capelle; Claire Nicodeme; Thomas Serre (2022). Don't lie to me! robust and efficient explainability with verified perturbation analysis. 
[b21] Cheng-Yu Hsieh; Chih-Kuan Yeh; Xuanqing Liu; Pradeep Kumar Ravikumar; Seungyeon Kim; Sanjiv Kumar; Cho-Jui Hsieh (2021). Evaluations and methods for explanation through robustness analysis. 
[b22] Yongduo Sui; Tianlong Chen; Pengfei Xia; Shuyao Wang; Bin Li (2022). Towards robust detection and segmentation using vertical and horizontal adversarial training. IEEE
[b23] Daniel Zügner; Amir Akbarnejad; Stephan Günnemann (2018). Adversarial attacks on neural networks for graph data. 
[b24] Daniel Zügner; Stephan Günnemann (2019). Adversarial attacks on graph neural networks via meta learning. 
[b25] Jure Leskovec; Jon M Kleinberg; Christos Faloutsos (2005). Graphs over time: densification laws, shrinking diameters and possible explanations. 
[b26] Beatrice Bevilacqua; Yangze Zhou; Bruno Ribeiro (2021-07). Size-invariant graph representations for graph classification extrapolations. 
[b27] Qi Liu; Miltiadis Allamanis; Marc Brockschmidt; Alexander L Gaunt (2018-12-03). Constrained graph variational autoencoders for molecule design. 
[b28] Haoyang Li; Xin Wang; Ziwei Zhang; Wenwu Zhu (2021). OOD-GNN: out-of-distribution generalized graph neural network. 
[b29] Xiaoxiao Ma; Jia Wu; Shan Xue; Jian Yang; Z Quan; Hui Sheng;  Xiong (2021). A comprehensive survey on graph anomaly detection with deep learning. 
[b30] Yuan Gao; Xiang Wang; Xiangnan He; Zhenguang Liu; Huamin Feng; Yongdong Zhang (2023). Alleviating structural distribution shift in graph anomaly detection. ACM
[b31] Xian Teng; Muheng Yan; Ali Mert Ertugrul; Yu-Ru Lin (2018). Deep into hypersphere: Robust and unsupervised anomaly discovery in dynamic networks. 
[b32] Yuan Gao; Xiang Wang; Xiangnan He; Zhenguang Liu; Huamin Feng; Yongdong Zhang (2023). Addressing heterophily in graph anomaly detection: A perspective of graph spectrum. ACM
[b33] Yuan Gao; Xiang Wang; Xiangnan He; Huamin Feng; Yong-Dong Zhang (2023). Rumor detection with self-supervised learning on texts and social graph. Frontiers Comput. Sci
[b34] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models. 
[b35] Prafulla Dhariwal; Alexander Quinn; Nichol  (2021). Diffusion models beat gans on image synthesis. 
[b36] Thomas N Kipf; Max Welling (2017). Semi-supervised classification with graph convolutional networks. 
[b37] Federico Monti; Davide Boscaini; Jonathan Masci; Emanuele Rodolà; Jan Svoboda; Michael M Bronstein (2017). Geometric deep learning on graphs and manifolds using mixture model cnns. 
[b38] Xiang Wang; Ying-Xin Wu; An Zhang; Xiangnan He; Tat-Seng Chua (2021). Towards multigrained explainability for graph neural networks. 
[b39] Jeroen Kazius; Ross Mcguire; Roberta Bursi (2005). Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry
[b40] Kaspar Riesen; Horst Bunke (2008). IAM graph database repository for graph based pattern recognition and machine learning. 
[b41] R Ramprasaath; Michael Selvaraju; Abhishek Cogswell; Ramakrishna Das; Devi Vedantam; Dhruv Parikh;  Batra (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. 
[b42] Dongsheng Luo; Wei Cheng; Dongkuan Xu; Wenchao Yu; Bo Zong; Haifeng Chen; Xiang Zhang (2020). Parameterized explainer for graph neural network. 
[b43] Patrick Schwab; Walter Karlen (). Cxplain: Causal explanations for model interpretation under uncertainty. 
[b44]  Curran Associates;  Inc (2019). . 
[b45] Jiliang Hao Yuan; Xia Tang; Shuiwang Hu;  Ji (2020). XGNN: towards model-level explanations of graph neural networks. 
[b46] Eulalia Szmidt; Janusz Kacprzyk (2011). The spearman and kendall rank correlation coefficients between intuitionistic fuzzy sets. Atlantis Press
[b47] Anna Himmelhuber; Mitchell Joblin; Martin Ringsquandl; Thomas A Runkler (2021). Demystifying graph neural network explanations. 
[b48] Benjamín Sánchez-Lengeling; Jennifer N Wei; Brian K Lee; Emily Reif; Peter Wang; Wesley Wei Qian; Kevin Mccloskey; Lucy J Colwell; Alexander B Wiltschko (2020). Evaluating attribution for graph neural networks. 
[b49] Chirag Agarwal; Owen Queen; Himabindu Lakkaraju; Marinka Zitnik (2022). Evaluating explainability for graph neural networks. 
[b50] Chirag Agarwal; Marinka Zitnik; Himabindu Lakkaraju (2022). Probing GNN explainers: A rigorous theoretical and empirical analysis of GNN explanation methods. AISTATS
[b51] Lukas Faber; K Amin; Roger Moghaddam;  Wattenhofer (2021). When comparing to ground truth is wrong: On evaluating GNN explanation methods. 
[b52] He Zhang; Bang Wu; Xingliang Yuan; Shirui Pan; Hanghang Tong; Jian Pei (2022). Trustworthy graph neural networks: Aspects, methods and trends. 
[b53] Yongduo Sui; Xiang Wang; Tianlong Chen; Meng Wang; Xiangnan He; Tat-Seng Chua (2023). Inductive lottery ticket learning for graph neural networks. Journal of Computer Science and Technology
[b54] Gang Liu; Tong Zhao; Jiaxin Xu; Tengfei Luo; Meng Jiang (2022). Graph rationalization with environment-based augmentations. 
[b55] Chun-Hao Chang; Elliot Creager; Anna Goldenberg; David Duvenaud (2019). Explaining image classifiers by counterfactual generation. 
[b56] Junfeng Fang; Xiang Wang; An Zhang; Zemin Liu; Xiangnan He; Tat-Seng Chua (2023). Cooperative explanations of graph neural networks. ACM
[b57] Ari S Morcos; Haonan Yu; Michela Paganini; Yuandong Tian (2019). One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. 
[b58] Kun Wang; Yuxuan Liang; Pengkun Wang; Xu Wang; Pengfei Gu; Junfeng Fang; Yang Wang (2022). Searching lottery tickets in graph neural networks: A dual perspective. 
[b59] Yongduo Sui; Xiang Wang; Jiancan Wu; Min Lin; Xiangnan He; Tat-Seng Chua (2022). Causal attention for interpretable and generalizable graph classification. 
[b60] Tianlong Chen; Yongduo Sui; Xuxi Chen; Aston Zhang; Zhangyang Wang (2021). A unified lottery ticket hypothesis for graph neural networks. 
[b61] Yanfang Wang; Yongduo Sui; Xiang Wang; Zhenguang Liu; Xiangnan He (2022). Exploring lottery ticket hypothesis in media recommender systems. International Journal of Intelligent Systems
[b62] Junfeng Fang; Wei Liu; An Zhang; Xiang Wang; Xiangnan He; Kun Wang; Tat-Seng Chua (2022). On regularization for explaining graph neural networks: An information theory perspective. 
[b63] Jiaxing Zhang; Dongsheng Luo; Hua Wei (2023). Mixupexplainer: Generalizing explanations for graph neural networks with data augmentation. ACM
[b64] Lukas Faber; K Amin; Roger Moghaddam;  Wattenhofer (2020). Contrastive graph neural network explanation. 
[b65] Matthias Fey; Jan E Lenssen (2019). Fast graph representation learning with PyTorch Geometric. 
[b66] Petar Velickovic; Guillem Cucurull; Arantxa Casanova; Adriana Romero; Pietro Liò; Yoshua Bengio (2017). Graph attention networks. 
[b67] Keyulu Xu; Weihua Hu; Jure Leskovec; Stefanie Jegelka (2019). How powerful are graph neural networks?. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: The pipeline of OAR, which takes both model behavior and data distribution into account.
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: The performance of various evaluation metrics. (a) Correlation between metrics and Recall across various backbone explainers, where the vertical axis represents the normalized Kendall rank correlation. (b) Consistency between Recall and the scores provided by metrics. The more monotonously increasing the curve is, the better the evaluation metric is. Best viewed in color.
Data: 

Figure fig_2: 1
Type: figure
Caption: Algorithm 11Evaluation Process of OAR Input: Trained GNN f ; To-be-evaluated subgraph G s and its corresponding original graph G and dataset D; Perturbation ratio R; Number of adversarial graphs N adv Output: Evaluation score s 1: Train a standard VGAE on D according to [19]. 2: c ← arg max i f (G) i . 3: for i = 1, 2, . . . , N adv do 4:
Data: 

Figure fig_3: 213
Type: figure
Caption: Algorithm 2 1 N L -1 3 :213Sampling Fake Explanatory Subgraphs for General Evaluation Input: Dataset D = {G 1 , G 2 , . . . , G N }; Number of Recall levels N L ; Number of sampled subgraphs per graph N sub ; Size of sampled subgraph K sub Output: Pairs of Recall level and corresponding subgraphs (L k , {G i,j s,k | i = 1, 2, . . . , N ; j = 1, 2, . . . , N sub }), k = 1, 2 . . . , N L 1: for k = 1, 2, . . . , N L do 2: L k ← k-for i = 1, 2, . . . , N do 4:
Data: 

Figure fig_4: 
Type: figure
Caption: All experiments are conducted on a Linux machine with 8 NVIDIA GeForce RTX 3090 (24 GB) GPUs. CUDA version is 11.6 and Driver Version is 510.39.01. All codes are written under Python 3.9.13 with PyTorch 1.13.0 and PyTorch Geometric (PyG)[65] 2.2.0. We adopt the Adam optimizer throughout all experiments.
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure tab_0: 1
Type: table
Caption: Overall evaluation scores and rankings of explainers under different evaluation methods. Symbol (•) indicates the rank of explainers. Our methods, i.e., OAR and SimOAR are bold and the best-performing methods are denoted with the superscript asterisk.
Data: SAGradCAMGNNExplainer PGExplainerCXPlainReFineτ ↑

Figure tab_1: 2
Type: table
Caption: Per-graph time consumption. (ms)
Data: BA3TR3MNIST-sp MUTAGRM0.11±0.02 0.10±0.010.15±0.030.13±0.01DSE1.73±0.27 1.60±0.142.12±0.391.88±0.25OAR1.16±0.05 1.32±0.061.77±0.101.78±0.03SimOAR 0.84±0.07 0.65±0.081.03±0.070.96±0.06

Figure tab_2: 3
Type: table
Caption: Average consistency scores of different evaluation metrics (corresponding to Table 1 in the main paper).
Data: SHAPEGGEN TR3 MUTAG MNIST BA3GEF0.8000.7340.8000.9340.734SimOAR0.8000.8670.8000.9340.934OAR0.8670.9341.0000.9341.000

Figure tab_3: 4
Type: table
Caption: Correlation between metrics and Recall across explanatory subgraphs in four node classification datasets in[11] (corresponding to Figure3(a) in the main paper).
Data: BA-Shapes BA-Community Tree-Cycles Tree-GridRM0.3120.3210.2950.411DSE0.4060.3770.3430.384OAR0.5420.5600.4890.440SimOAR0.5270.5350.4710.431

Figure tab_4: 5
Type: table
Caption: Correlation between metrics and Recall while evaluating the explanations generated by the inherent explainable GNN, GSAT (corresponding to Figure3(a) in the main paper).
Data: BA3 MUTAG TR3 MINSTRM0.3760.3810.4020.337DSE0.4110.4250.4170.319OAR0.6130.5900.5850.606SimOAR 0.5980.5720.5520.581


Formulas:
Formula formula_0: δ Gs = min A ′ u∈V v∈V\u |A uv -A ′ uv | s.t. arg max i f (G ′ ) i ̸ = arg max i y i , u∈Vs v∈Vs\u |A uv -A ′ uv | = 0,(1)

Formula formula_1: δ * Gs = E G ′ (f (G ′ ) c -y c ) s.t. c = arg max i y i , u∈Vs v∈Vs\u |A uv -A ′ uv | = 0,(2)

Formula formula_2: q(Z|A ′ , X ′ ) = |V ′ | i=1 q(z i |A ′ , X ′ ) = |V ′ | i=1 N (z i | µ i , diag(σ 2 i )),(3)

Formula formula_3: p(A ′ |Z) = |V ′ | i=1 |V ′ | j=1 p(A ′ ij | z i , z j ), with p(A ′ ij = 1 | z i , z j ) = σ(z i ⊤ z j ),(4)

Formula formula_4: L recon (G ′ ) = -log p(A ′ | Z), with Z = µ = GCN µ (A ′ , X ′ ). (5

Formula formula_5: )

Formula formula_6: Recall(G s ) = E s E GT s / E GT s

Formula formula_7: τ r i n i=1 , s i n i=1 = 2 n(n + 1) i<j I sgn r i -r j = sgn s i -s j ,(6)

Formula formula_8: as τ * = τ 1 N N i=1 Recall G i,h s h∈H , 1 N N i=1 s i,h * h∈H

Formula formula_9: defined as τ * = τ Recall G i s N i=1 , s i * N i=1

Formula formula_10: y (i) ← f (G ′(i) ) c .

Formula formula_11: L (i)

Formula formula_12: (i) OOD ← 1/L (i) recon j 1/L (j) recon , i = 1, 2, . . . , N adv . 9: s ← i w (i) OOD • y (i) .

Formula formula_13: K pos ← ⌊L k × K GT ⌉ 6: K neg ← K sub -K pos 7:
