Abstract: Interpretability has largely focused on local explanations, i.e. explaining why a model made a particular prediction for a sample. These explanations are appealing due to their simplicity and local fidelity. However, they do not provide information about the general behavior of the model. We propose to leverage model distillation to learn global additive explanations that describe the relationship between input features and model predictions. These global explanations take the form of feature shapes, which are more expressive than feature attributions. Through careful experimentation, we show qualitatively and quantitatively that global additive explanations are able to describe model behavior and yield insights about models such as neural nets. A visualization of our approach applied to a neural net as it is trained is available at https://youtu.be/ErQYwNqzEdc
Keywords: global interpretability, additive explanations, model distillation, neural nets, tabular data
TL;DR: We propose to leverage model distillation to learn global additive explanations in the form of feature shapes (that are more expressive than feature attributions) for models such as neural nets trained on tabular data.
Code: [![github](/images/github_icon.svg) shftan/distilled_additive_explanations](https://github.com/shftan/distilled_additive_explanations)
11 Replies
Loading