TL;DR: We propose new objective measurement for evaluating explanations based on the notion of adversarial robustness. The evaluation criteria further allows us to derive new explanations which capture pertinent features qualitatively and quantitatively.
Abstract: Among multiple ways of interpreting a machine learning model, measuring the importance of a set of features tied to a prediction is probably one of the most intuitive way to explain a model. In this paper, we establish the link between a set of features to a prediction with a new evaluation criteria, robustness analysis, which measures the minimum tolerance of adversarial perturbation. By measuring the tolerance level for an adversarial attack, we can extract a set of features that provides most robust support for a current prediction, and also can extract a set of features that contrasts the current prediction to a target class by setting a targeted adversarial attack. By applying this methodology to various prediction tasks across multiple domains, we observed the derived explanations are indeed capturing the significant feature set qualitatively and quantitatively.
Keywords: Interpretability, Explanations, Adversarial Robustness
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:2006.00442/code)
Original Pdf: pdf
11 Replies
Loading