Learning Explainable Models Using Attribution Priors

Anonymous

Sep 25, 2019 ICLR 2020 Conference Blind Submission readers: everyone Show Bibtex
  • Keywords: Deep Learning, Interpretability, Attributions, Explanations, Biology, Health, Computational Biology
  • TL;DR: A method for encouraging axiomatic feature attributions of a deep model to match human intuition.
  • Abstract: Two important topics in deep learning both involve incorporating humans into the modeling process: Model priors transfer information from humans to a model by regularizing the model's parameters; Model attributions transfer information from a model to humans by explaining the model's behavior. Previous work has taken important steps to connect these topics through various forms of gradient regularization. We find, however, that existing methods that use attributions to align a model's behavior with human intuition are ineffective. We develop an efficient and theoretically grounded feature attribution method, expected gradients, and a novel framework, attribution priors, to enforce prior expectations about a model's behavior during training. We demonstrate that attribution priors are broadly applicable by instantiating them on three different types of data: image data, gene expression data, and health care data. Our experiments show that models trained with attribution priors are more intuitive and achieve better generalization performance than both equivalent baselines and existing methods to regularize model behavior.
  • Code: https://www.dropbox.com/sh/xvt3vqv8xjb5nwh/AACgt-0OxiefImjVXX5UJSuua?dl=0
0 Replies

Loading