Rethinking Stability for Attribution-based Explanations

Chirag Agarwal; Nari Johnson; Martin Pawelczyk; Satyapriya Krishna; Eshika Saxena; Marinka Zitnik; Himabindu Lakkaraju

Rethinking Stability for Attribution-based Explanations

Chirag Agarwal, Nari Johnson, Martin Pawelczyk, Satyapriya Krishna, Eshika Saxena, Marinka Zitnik, Himabindu Lakkaraju

Published: 25 Mar 2022, Last Modified: 14 Jan 2026ICLR 2022 PAIR^2Struct OralReaders: Everyone

Keywords: stability, explainability

TL;DR: Stability metrics should measure the change in output explanation with respect to not only the input changes, but also with respect to the change in the behaviour of the underlying model.

Abstract: As attribution-based explanation methods are increasingly used to establish model trustworthiness in high-stakes situations, it is critical to ensure that these explanations are stable, e.g., robust to infinitesimal perturbations to an input. However, previous works have shown that state-of-the-art explanation methods generate unstable explanations. Here, we introduce metrics to quantify the stability of an explanation and show that several popular explanation methods are unstable. In particular, we propose new Relative Stability metrics that measure the change in output explanation with respect to change in input, model representation, or output of the underlying predictor. Finally, our experimental evaluation with three real-world datasets demonstrates interesting insights for seven explanation methods and different stability metrics.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/rethinking-stability-for-attribution-based/code)

0 Replies

Loading