Abstract: In this work, we discover whether neural network interpreters can be fooled by manipulating adversarial models. We conduct rigorous and extensive experiments with certain interpretation methods, e.g. LRP, Grad-CAM, and discuss hyperparameters applied to models and foolings. Our results are validated by comparing the visual interpretations before and after the fooling and reporting quantitative metrics that measure the deviations from the original interpretations. The work shows that it is possible to manipulate adversarial models which can affect what models interpret regarding the cause of the prediction while not being noticed. We believe this work can facilitate developing a more robust and reliable neural network interpreter that can truly interpret the network’s underlying decision-making process.
Track: Ablation
NeurIPS Paper Id: https://openreview.net/forum?id=HJxxPEHgUB¬eId=HketC3qd9S
5 Replies
Loading