Can we Trust Explanation! Evaluation of Model-agnostic explanation techniques

Syed Ihtesham Hussain Shah; Annette Ten Teije; josé Volders

Can we Trust Explanation! Evaluation of Model-agnostic explanation techniques

Syed Ihtesham Hussain Shah, Annette Ten Teije, josé Volders

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Type B (Encore Abstracts)

Keywords: Explainable AI, Black-Box, LIME, SHAP, Breast Cancer

Abstract: Explainable AI (XAI) assists clinicians and researchers in understanding the rationale behind the predictions made by data-driven models which helps them to make informed decisions and trust the model's outputs. However, given the variety of explanation techniques, there is no universally applicable evaluation metric that can reliably assess the quality of all explanations. This study addresses this gap by introducing a set of universal evaluation metrics designed to assess explanation performance across different techniques and contexts. We conduct a comprehensive comparison of two widely used post-hoc explanation methods: Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive Explanations (SHAP) applied to a highly imbalanced multiclass-multioutput breast cancer treatment prediction task. These methods were evaluated using proposed evaluation matrices that included fidelity, stability, consistency, and alignment with clinical guidelines. Our findings reveal that SHAP generally provides more faithful and consistent explanations than LIME, especially in alignment with clinical knowledge. These results reinforce the need for tailored evaluation strategies rather than relying on a single universal metric, highlighting that the choice of explanation method should be informed by the specific clinical context and interpretability goals. **Introduction** In the healthcare industry, XAI is frequently used for clinical diagnosis [1], drug delivery [2], disease classification, and treatment recommendations [3, 4]. Some studies [7, 8] highlight the strengths and weaknesses of two widely used post-hoc explanatory methods LIME [5] and SHAP [6]. However, to the best of our knowledge, direct comparisons of LIME and SHAP using the same evaluation metrics, especially in healthcare domains remain limited. The primary objective of this research is to propose a set of universally applicable evaluation metrics and to conduct an in-depth comparison of LIME and SHAP for predicting breast cancer treatments. For a fair comparison of the explanation techniques, we introduced both Application-level and Human-level evaluations. Application-level evaluation consists of assessing fidelity, stability, and alignment with medical guidelines, while Human-level evaluation involves a qualitative analysis of the generated explanations, focusing on their interpretability and usefulness from an expert’s perspective. **Fidelity** measures the similarity of the prediction made by a black box and a surrogate model, it can be represented as: $R^2 = 1 - \frac{\sum_{i=1}^k (f(z^{(i)}) - g(z^{(i)}))^2}{\sum_{i=1}^k (f(z^{(i)}) - \bar{f})^2}$. Where $f(z^{(i)})$ are predictions for perturbed samples from the complex model, $g(z^{(i)})$ are predictions for perturbed samples from the surrogate model and $\bar{f}$ is the mean of the original model's predictions. **Stability** compares the variables composition in the explanations that are generated multiple times for the same instance. $ Stability = \frac{ \sum_{1}^k \frac{C_{pair} } {p} } { | C_m^2 ( E^{1}, ... , E^{m} ) | }$. Where concordance function $C_{pair}$ returns cardinality of the intersection between two explanations, $p$ is the number of variables in the explanations and $C_m^2$ is the pair in the explanation $E^{i}$. **Comparison with guidelines:** Let $G$ is the set of medical guidelines, where $G_i \in { High, Medium, Low }$ represents the importance of feature $i$. Comparison index $\Gamma(G, E)$, measures the concordance scores between explanations $E$ and medical guidelines $G$, can be defined as: $ \Gamma(E, G) = \frac{1}{|G|} \sum_{i=1}^{|G|} \mathcal{I}_i $. A high value of $\Gamma(E, G)$ close to 1 indicates that the explanations are completely matches the guidelines and vice versa. **Human-Level Evaluation:** We conducted survey with the clinician who evaluated the explanations based on seven key criteria: Understandability, Satisfaction, Level of Detail, Completeness, Trustworthiness, Predictability, and Safety/Reliability. Each criterion was rated on an integer scale from 1 (very poor) to 10 (excellent), providing a quantitative measure of the perceived quality of the explanations. **Conclusion** In this paper, we introduced a set of metrics to assess the quality of any model-agnostic explanation technique, regardless of its underlying working principle. We focused on two widely used explanation methods, LIME and SHAP, which operate on fundamentally different principles, and conducted a comprehensive analysis of their performance in predicting breast cancer treatments using a highly imbalanced synthetic IKNL dataset. Our experiments showed that SHAP outperformed LIME in terms of fidelity, stability and and more consistently aligned with medical guidelines and with the expert evaluation than LIME. For additional information, please see the full article at the following link: https://www.scitepress.org/Papers/2025/131574/131574.pdf

Serve As Reviewer: ~Annette_Ten_Teije2

Submission Number: 46

Loading