[RE] Fooling Neural Network Interpretations via Adversarial Model Manipulation

Nikhil Krishna; Chenqing Hua; Yiran Wang

[RE] Fooling Neural Network Interpretations via Adversarial Model Manipulation

Nikhil Krishna, Chenqing Hua, Yiran Wang

29 Dec 2019 (modified: 05 May 2023)NeurIPS 2019 Reproducibility Challenge Blind ReportReaders: Everyone

Abstract: In this work, we discover whether neural network interpreters can be fooled by manipulating adversarial models. We conduct rigorous and extensive experiments with certain interpretation methods, e.g. LRP, Grad-CAM, and discuss hyperparameters applied to models and foolings. Our results are validated by comparing the visual interpretations before and after the fooling and reporting quantitative metrics that measure the deviations from the original interpretations. The work shows that it is possible to manipulate adversarial models which can affect what models interpret regarding the cause of the prediction while not being noticed. We believe this work can facilitate developing a more robust and reliable neural network interpreter that can truly interpret the network’s underlying decision-making process.

Track: Ablation

NeurIPS Paper Id: https://openreview.net/forum?id=HJxxPEHgUB&noteId=HketC3qd9S

5 Replies

Loading