Removing input features via a generative model to explain their attributions to classifier's decisions

Chirag Agarwal; Dan Schonfeld; Anh Nguyen

Removing input features via a generative model to explain their attributions to classifier's decisions

Chirag Agarwal, Dan Schonfeld, Anh Nguyen

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Abstract: Interpretability methods often measure the contribution of an input feature to an image classifier's decisions by heuristically removing it via e.g. blurring, adding noise, or graying out, which often produce unrealistic, out-of-samples. Instead, we propose to integrate a generative inpainter into three representative attribution map methods as a mechanism for removing input features. Compared to the original counterparts, our methods (1) generate more plausible counterfactual samples under the true data generating process; (2) are more robust to hyperparameter settings; and (3) localize objects more accurately. Our findings were consistent across both ImageNet and Places365 datasets and two different pairs of classifiers and inpainters.

Keywords: attribution maps, generative models, inpainting, counterfactual, explanations, interpretability, explainability

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/removing-input-features-via-a-generative/code)

Original Pdf: pdf

12 Replies

Loading