Regularizing Black-box Models for Improved Interpretability

Gregory Plumb; Maruan Al-Shedivat; Eric Xing; Ameet Talwalkar

Regularizing Black-box Models for Improved Interpretability

Gregory Plumb, Maruan Al-Shedivat, Eric Xing, Ameet Talwalkar

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Interpretable Machine Learning, Local Explanations, Regularization

TL;DR: If you train your model with our regularizers, black-box explanations systems will work better on the resulting model. Further, its likely that the resulting model will be more accurate as well.

Abstract: Most of the work on interpretable machine learning has focused on designingeither inherently interpretable models, which typically trade-off accuracyfor interpretability, or post-hoc explanation systems, which lack guarantees about their explanation quality. We explore an alternative to theseapproaches by directly regularizing a black-box model for interpretabilityat training time. Our approach explicitly connects three key aspects ofinterpretable machine learning: (i) the model’s internal interpretability, (ii)the explanation system used at test time, and (iii) the metrics that measureexplanation quality. Our regularization results in substantial improvementin terms of the explanation fidelity and stability metrics across a range ofdatasets and black-box explanation systems while slightly improving accuracy. Finally, we justify theoretically that the benefits of our regularizationgeneralize to unseen points.

Code: https://github.com/ForReview11235/CodeForICLR2020

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/regularizing-black-box-models-for-improved/code)

Original Pdf: pdf

10 Replies

Loading