['1c1', '< Title: Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis', '---', '> Title: OOD-Resistant Adversarial Robustness (OAR): A Novel Metric for Robust Evaluation of GNN Explanations', '3c3', "< Abstract: This work studies the evaluation of explaining graph neural networks (GNNs), which is crucial to the credibility of post-hoc explainability in practical usage. Conventional evaluation metrics, and even explanation methods -which mainly follow the paradigm of feeding the explanatory subgraph to the model and measuring output difference -mostly suffer from the notorious out-of-distribution (OOD) issue. Hence, in this work, we endeavor to confront this issue by introducing a novel evaluation metric, termed OOD-resistant Adversarial Robustness (OAR). Specifically, we draw inspiration from adversarial robustness and evaluate post-hoc explanation subgraphs by calculating their robustness under attack. On top of that, an elaborate OOD reweighting block is inserted into the pipeline to confine the evaluation process to the original data distribution. For applications involving large datasets, we further devise a Simplified version of OAR (SimOAR), which achieves a significant improvement in computational efficiency at the cost of a small amount of performance. Extensive empirical studies validate the effectiveness of our OAR and SimOAR. Code is available at https://github.com/MangoKiller/SimOAR_OAR. Recently, a compromised paradigm -Feature Removal [12, 13] -has been prevailing to quantitatively evaluate the explanation's predictive power as compared to the full graph, without exploiting the human supervision and ground truth. The basic idea is to first remove the unimportant features and feed the remaining part (i.e., explanatory subgraph) into the GNN, and then observe how the prediction changes. The prediction discrepancy instantiates Accuracy [8] and Fidelity [9] of the † Liu Wei is equal contribution to this paper.", '---', '> Abstract: Reliable evaluation of post-hoc explanations for Graph Neural Networks (GNNs) is critical for their trustworthy deployment, yet conventional metrics often struggle with out-of-distribution (OOD) issues. This work directly confronts this challenge by introducing OOD-resistant Adversarial Robustness (OAR), a novel evaluation metric. Inspired by adversarial robustness, OAR assesses the quality of an explanation subgraph by measuring its robustness under attack, crucially integrating an OOD reweighting block to ensure the evaluation remains within the original data distribution. For large-scale applications, we propose a Simplified OAR (SimOAR), which significantly enhances computational efficiency with minimal performance compromise. Extensive empirical studies across various explanation methods, datasets, and GNN backbones demonstrate the superior effectiveness and consistency of OAR and SimOAR compared to existing removal- and generation-based metrics. Code is available at https://github.com/MangoKiller/SimOAR_OAR. † Liu Wei is equal contribution to this paper.', '6,12c6', '< Post-hoc explainability has manifested its extraordinary power to explain graph neural networks (GNNs) [1,2,3,4]. Given a GNN-generated prediction for a graph, it aims to identify an explanatory subgraph, which is expected to best support the prediction and make the decision-making process more credible, fair, and understandable [5,6,7]. However, the reliable evaluation of explanation quality remains a key challenge. As a primary solution, Human supervision seeks to justify whether the explanations align with human knowledge [8,9], but it is often too subjective, thus hardly providing quantifiable assessments. Another straightforward solution is quantitatively measuring the agreement between the generated and ground-truth explanations, such as Precision and Recall [10,11]. Unfortunately, access to the ground truth is usually unavailable and labor-extensive, thereby limiting the scope of evaluations based on this method.', '< Figure 1: Pipelines and flaws of different evaluation methods. In the "Input" graph, -NH 2 is considered as the ground truth explanation for its mutagenicity. Best viewed in color. explanation, reflecting "how accurate and faithful the explanation is to recover the prediction of the input graph". Despite the prevalence, these removal-based metrics usually come with the caveat of the out-of-distribution (OOD) issue [14,13]. Specifically, as the after-removal subgraphs are likely to lie off the distribution of full graphs [15,16], the GNN is forced to handle these off-manifold inputs and easily gets erroneous predictions [17,18]. Take Figure 1 (a) as an example. For the full molecular graph, the GNN classifies it as "mutagenic", which is reasonable due to the presence of mutagenic -NH 2 group; whereas, when taking the subgraph, i.e., non-mutagenic C-Cl group solely as the input, the GNN surprisingly maintains its output "mutagenic". Clearly, the prediction on the explanatory subgraph might be skeptical, which easily deteriorates the faithfulness of the removal-based evaluations.', '< In sight of this, recent efforts [18,17] are beginning to mitigate the OOD issue via the Generationbased metrics. Instead of directly feeding the to-be-evaluated subgraph into the target GNN, they use a generative model [19,20] to imagine and generate a new full graph conditioned on the subgraph. These methods believe that the generation process could infill the subgraph and pull it closer to the original graph distribution. As Figure 1 (b) shows, comparing the predictions on this new graph and the original graph could be the surrogate evaluation. While intuitively appealing, the generative models easily inherit the data bias and inject it into the infilling part. Considering Figure 1 (b) again, since molecules in the Mutagenicity dataset comprising non-mutagenic chloride (-Cl) always carry amino (-NH 2 ), a generative model is prone to capture this occurrence bias and tend to infill -NH 2 with the -Cl-involved subgraphs. This bias not only exerts on the generations but also makes the evaluation inconsistent with the GNN behavior: in the generative model, -Cl is assigned with more "mutagenic" scores as it usually accompanies the mutagenic partner -NH 2 ; in contrast, the GNN finds no "mutagenic" clues in -Cl. In a nutshell, the generation-based metrics show respect to the data distribution somehow but could be inconsistent with GNNs\' behavior and lose control of the infilling part.', '< Scrutinizing these removal-and generation-based metrics (as summarized in Figure 1), we naturally raise a question: "Can we design a metric that respects the data distribution and GNN behavior simultaneously?" To this end, we draw inspiration from adversarial robustness [21,22] and propose a new evaluation framework, OAR (OOD-resistant Adversarial Robustness), to help reliably assess the explanations. As shown in Figure 1 (c), OAR encapsulates two components: constrained attack and OOD reweighting, which respect the GNN behavior and data distribution, respectively. Specifically:', '< • Intuitively, perturbations on label-irrelevant features should be ineffective to the GNN prediction, while those on label-relevant features are supposed to be impactful and destructive to the prediction [22,21]. Hence, for the original input graph, the attack model performs perturbations constrained on the complementary part of its explanation. This perturbing game aims to naturally take control of the "infilling" process, making the explanatory subgraph less influenced by its infilling. • Having obtained a set of perturbed graphs, the reweighting component estimates the OOD score of each perturbed graph, which reflects the degree of distribution shift from the original data distribution. Then, we feed these graphs into GNN and reweight their predictions with OOD scores. The sum of weighted predictions can quantitatively evaluate the importance of the target subgraph.', '< We validate the effectiveness of OAR in evaluation tasks across various start-of-the-art explanation methods, datasets, and backbone GNN models. OAR has manifested surprising consistency with the metrics like Precision, Recall and Human Supervision. Furthermore, to better generalize to the large datasets, we also provide a Simplified version of OAR (SimOAR) achieving significant improvements in computational efficiency at the expense of a small amount of performance degradation. Our main contributions can be summarized as follows:', "< • We propose a novel metric, OAR for evaluating GNNs' explainability, which tries to resolve the limitation of current removal-and generative-based evaluations by taking both data distribution and GNN behavior into account (Section 2.2). • We provide a simplified version of OAR, SimOAR for better generalization to the evaluation tasks involving large datasets, which greatly shortens the execution time while only sacrificing a small amount of performance (Section 2.3). • Experimental results demonstrate that our OAR/SimOAR outperforms the current evaluation metrics by a large margin, and further validate the high efficiency of SimOAR (Section 3).", '---', '> Post-hoc explainability has emerged as a crucial area for interpreting Graph Neural Networks (GNNs) [1,2,3,4], aiming to identify salient subgraphs that justify model predictions and enhance trust, fairness, and understanding [5,6,7]. However, the reliable evaluation of these explanations remains a significant challenge. Traditional approaches, such as human supervision [8,9] and agreement with ground-truth explanations [10,11], suffer from subjectivity, labor-intensiveness, and limited availability of ground truth, restricting their practical applicability.', '13a8,21', '> Figure 1: Pipelines and flaws of different evaluation methods. In the "Input" graph, -NH 2 is considered as the ground truth explanation for its mutagenicity. Best viewed in color.', '> A prevailing quantitative evaluation paradigm is Feature Removal [12,13], which assesses an explanation\'s predictive power by removing "unimportant" features and observing the GNN\'s output on the remaining subgraph. Metrics like Accuracy [8] and Fidelity [9] stem from this idea. Despite their widespread use, removal-based metrics are severely hampered by the notorious out-of-distribution (OOD) issue [14,13]. When features are removed, the resulting subgraphs often deviate significantly from the original data distribution [15,16], forcing the GNN to process off-manifold inputs and potentially yield erroneous or unfaithful predictions [17,18]. For instance, as illustrated in Figure 1 (a), a GNN might correctly classify a full molecular graph as "mutagenic" due to a -NH 2 group. Yet, when presented with only a non-mutagenic C-Cl subgraph, it might still incorrectly predict "mutagenic," undermining the faithfulness of the explanation evaluation.', '> ', '> In response to the OOD issue, Generation-based metrics [18,17] have been proposed. These methods employ generative models [19,20] to "infill" the subgraph, conditioning on it to generate a new full graph that is theoretically closer to the original data distribution. As shown in Figure 1 (b), the evaluation then compares predictions on this new graph with those on the original. While conceptually appealing, generative models often inherit and amplify data biases, injecting them into the infilling process. For example, if molecules in the Mutagenicity dataset with non-mutagenic chloride (-Cl) frequently co-occur with amino (-NH 2 ) groups, a generative model might erroneously infill -NH 2 when given a -Cl-involved subgraph. This bias not only distorts the generated graphs but also leads to inconsistencies with the GNN\'s behavior: the generative model might assign high "mutagenic" scores to -Cl due to its co-occurrence with -NH 2 , whereas the GNN itself finds no mutagenic cues in -Cl. Thus, generation-based metrics, while addressing OOD to some extent, can be inconsistent with GNN behavior and lack precise control over the infilling process.', '> ', '> These limitations of removal- and generation-based metrics (summarized in Figure 1) lead us to a critical question: "Can we devise an evaluation metric that simultaneously respects both the data distribution and the GNN\'s intrinsic behavior?" To address this, we introduce OAR (OOD-resistant Adversarial Robustness), a novel and robust evaluation framework inspired by adversarial robustness [21,22]. As depicted in Figure 1 (c), OAR comprises two key components: constrained attack and OOD reweighting, which are designed to account for GNN behavior and data distribution, respectively. Specifically:', '> •   **Constrained Attack:** Drawing from the principle that perturbations on label-irrelevant features should minimally affect GNN predictions, while those on label-relevant features should be impactful [22,21], our attack model applies perturbations *only* to the complementary part of the explanation subgraph. This mechanism inherently controls the "infilling" process, ensuring the explanatory subgraph\'s influence is isolated and evaluated against targeted, meaningful changes.', '> •   **OOD Reweighting:** After generating a set of perturbed graphs, this component estimates an "OOD score" for each, quantifying its deviation from the original data distribution. These OOD scores are then used to reweight the GNN\'s predictions on the perturbed graphs. By summing these weighted predictions, OAR quantitatively assesses the importance of the target subgraph, effectively marginalizing OOD instances.', '> ', "> We conduct extensive empirical studies, validating OAR's effectiveness across various state-of-the-art explanation methods, diverse datasets, and different GNN backbones. OAR consistently demonstrates superior alignment with metrics like Precision, Recall, and human supervision, outperforming existing removal- and generation-based methods. Furthermore, to facilitate scalability for large datasets, we introduce a Simplified version of OAR (SimOAR), which achieves significant computational efficiency improvements with a minimal trade-off in performance. Our main contributions are summarized as follows:", '> •   We propose OAR, a novel evaluation metric for GNN explainability, which effectively resolves the limitations of current removal- and generation-based approaches by explicitly considering both data distribution and GNN behavior (Section 2.2).', '> •   We introduce SimOAR, a simplified yet highly efficient variant of OAR, designed for large-scale evaluation tasks. SimOAR significantly reduces execution time while maintaining strong performance (Section 2.3).', "> •   Comprehensive experimental results demonstrate that OAR and SimOAR consistently outperform contemporary evaluation metrics by a substantial margin, further highlighting SimOAR's computational efficiency (Section 3).", '> ', '313d320', '< ']
