On the Generalization of Gradient-based Neural Network Interpretations

Ching Lam Choi; Farzan Farnia

On the Generalization of Gradient-based Neural Network Interpretations

Ching Lam Choi, Farzan Farnia

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: interpretability, generalization, robustness, explainable AI

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Feature saliency maps are commonly used for interpreting neural network predictions. This approach to interpretability is often studied as a post-processing problem independent of training setups, where the gradients of trained models are used to explain their output predictions. However, in this work, we observe that gradient-based interpretation methods are highly sensitive to the training set: models trained on disjoint datasets without regularization produce inconsistent interpretations across test data. Our numerical observations pose the question of how many training samples are required for accurate gradient-based interpretations. To address this question, we study the generalization aspect of gradient-based explanation schemes and show that the proper generalization of interpretations from training samples to test data requires more training data than standard deep supervised learning problems. We prove generalization error bounds for widely-used gradient-based interpretations, suggesting that the sample complexity of interpretable deep learning is greater than that of standard deep learning. Our bounds also indicate that Gaussian smoothing in the widely-used SmoothGrad method plays the role of a regularization mechanism for reducing the generalization gap. We evaluate our findings on various neural net architectures and datasets, to shed light on how training data affect the generalization of interpretation methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7141

Loading