Identifying regularization schemes that make feature attributions faithful

Published: 25 Oct 2023, Last Modified: 10 Dec 2023AI4D3 2023 PosterEveryoneRevisionsBibTeX
Keywords: feature attributions, regularization, faithfulness
TL;DR: Independent noise corruption during training is effective for making input-gradient explanations faithful.
Abstract: Feature attribution methods assign a score to each input dimension as a measure of the relevance of that dimension to a model's output. Despite wide use, the feature importance rankings induced by gradient-based feature attributions are unfaithful, that is, they do not correlate with the input-perturbation sensitivity of the model---unless the model is trained to be adversarially robust. Here we demonstrate that these concerns translate to models trained for protein function prediction tasks. Despite making a model's gradient-based attributions faithful to the model, adversarial training has low real-data performance. We find that independent Gaussian noise corruption is an effective alternative, to adversarial training, that confers faithfulness onto a model's gradient-based attributions without performance degradation. On the other hand, we observe no meaningful faithfulness benefits from regularization schemes like dropout and weight decay. We translate these insights to a real-world protein function prediction task, where the gradient-based feature attributions of noise-regularized models, correctly indicate low sensitivity to irrelevant gap tokens in a protein's sequence alignment.
Submission Number: 79
Loading