Regularizing Black-box Models for Improved InterpretabilityDownload PDF

25 Sept 2019 (modified: 22 Oct 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: Interpretable Machine Learning, Local Explanations, Regularization
TL;DR: If you train your model with our regularizers, black-box explanations systems will work better on the resulting model. Further, its likely that the resulting model will be more accurate as well.
Abstract: Most of the work on interpretable machine learning has focused on designingeither inherently interpretable models, which typically trade-off accuracyfor interpretability, or post-hoc explanation systems, which lack guarantees about their explanation quality. We explore an alternative to theseapproaches by directly regularizing a black-box model for interpretabilityat training time. Our approach explicitly connects three key aspects ofinterpretable machine learning: (i) the model’s internal interpretability, (ii)the explanation system used at test time, and (iii) the metrics that measureexplanation quality. Our regularization results in substantial improvementin terms of the explanation fidelity and stability metrics across a range ofdatasets and black-box explanation systems while slightly improving accuracy. Finally, we justify theoretically that the benefits of our regularizationgeneralize to unseen points.
Code: https://github.com/ForReview11235/CodeForICLR2020
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1902.06787/code)
Original Pdf: pdf
10 Replies

Loading