Identifying the Source of Vulnerability in Fragile Interpretations: A Case Study in Neural Text Classification
Abstract: Prior works mainly used input perturbation methods for testing stability of post-hoc interpretation methods and observed fragile interpretations. However, different works show conflicting results on the primary source of fragile interpretations because input perturbation can cause potential effects on the model and the interpretation methods. Instead, this work proposes a simple output perturbation method that circumvents models' potential effects by slightly modifying the prediction probability. We evaluate the proposed method using two popularly-used post-hoc interpretation methods (LIME and Sample Shapley), and CNN, LSTM, and BERT as the neural classifiers. The results show that post-hoc methods produce only slightly different interpretations under output perturbation. It suggests that the black-box model is the primary source causing fragile interpretations.
Paper Type: short
0 Replies
Loading