On the Instability of Local Posthoc Explanations

On the Instability of Local Posthoc Explanations

ACL ARR 2024 August Submission421 Authors

16 Aug 2024 (modified: 19 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Explanations of model decisions are important for building trust in machine learning systems, especially in high-stakes areas like healthcare. However, existing post-hoc explanation methods often suffer from instability, producing inconsistent results for similar inputs and thereby undermining their reliability. In this paper, we conduct a systematic investigation into the factors contributing to this instability across different model architectures and explanation methods. Our analysis reveals that model type, rather than hyperparameters, is the primary driver of stability, with transformer models exhibiting greater instability compared to architectures like LSTMs, regardless of model size. We also explore the role of sparsity in transformer models, finding that while sparse pretrained transformers improve the stability of gradient-based explanations, similar benefits are not observed with perturbation-based methods. Furthermore, our findings suggest that a portion of the disagreement between different explanation methods can be traced back to this instability, highlighting the importance of stable model explanations for developing more reliable and interpretable AI systems.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: calibration/uncertainty,feature attribution,robustness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 421

Loading