Abstract: Gradient-based analysis methods, such as saliency map visualizations and adversarial in- put perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into ques- tion the reliability of gradient-based analyses. In particular, we merge the layers of a tar- get model with a FACADE model that over- whelms the gradients without affecting the pre- dictions. This FACADE model can be trained to have gradients that are misleading and ir- relevant to the task, such as focusing only on the stop words in the input. On a vari- ety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversar- ial perturbations all identify unimportant or tar- geted tokens as being highly important. The code and a tutorial of this paper is available at http://ucinlp.github.io/facade.
0 Replies
Loading