Toward $\textbf{F}$aithfulness-guided $\textbf{E}$nsemble $\textbf{I}$nterpretation of Neural Network

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: feature attribution, Interpretability
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Interpretable and faithful explanations for specific neural inferences are essential for understanding and evaluating the behavior of models. For this purpose, feature attributions are highly favored for their interpretability. To objectively quantify the faithfulness of an attribution to the model, a widely used metric uses perturbations of the input that mask either the highly salient or highly non-salient features. These metrics, however, neglect the faithfulness of the attribution to the hidden-layer encodings of the model, and hence ignore its internal structure. In response, we propose a novel attribution method, $\textbf{FEI}$, which targets faithfulness to hidden layer representations. Moreover, the method optimizes the quality of the attribution according to the perturbation metrics using a novel smooth approximation of the metrics that allows effective optimization by gradient decent. This improve its performance on faithfullness evaluation. The method provides enhanced qualitative interpretability, while also achieving superior scores in quantitative faithfulness measurements.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3961
Loading