Regularizing Black-box Models for Improved InterpretabilityDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: Interpretable Machine Learning, Local Explanations, Regularization
  • TL;DR: If you train your model with our regularizers, black-box explanations systems will work better on the resulting model. Further, its likely that the resulting model will be more accurate as well.
  • Abstract: Most of the work on interpretable machine learning has focused on designingeither inherently interpretable models, which typically trade-off accuracyfor interpretability, or post-hoc explanation systems, which lack guarantees about their explanation quality. We explore an alternative to theseapproaches by directly regularizing a black-box model for interpretabilityat training time. Our approach explicitly connects three key aspects ofinterpretable machine learning: (i) the model’s internal interpretability, (ii)the explanation system used at test time, and (iii) the metrics that measureexplanation quality. Our regularization results in substantial improvementin terms of the explanation fidelity and stability metrics across a range ofdatasets and black-box explanation systems while slightly improving accuracy. Finally, we justify theoretically that the benefits of our regularizationgeneralize to unseen points.
  • Code:
10 Replies