Adversarial Attacks on Feature Visualization Methods

Jonathan Marty; Eugene Belilovsky; Michael Eickenberg

Adversarial Attacks on Feature Visualization Methods

Jonathan Marty, Eugene Belilovsky, Michael Eickenberg

Published: 05 Dec 2022, Last Modified: 05 May 2023MLSW2022Readers: Everyone

Abstract: The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Feature visualization approaches are one set of techniques used to interpret and analyze trained deep learning models. On the other hand interpretability methods themselves may be subject to be deceived. In particular, we consider the idea of an adversary manipulating a model for the purpose of deceiving the interpretation. Focusing on the popular feature visualizations associated with CNNs we introduce an optimization framework for modifying the outcome of feature visualization methods.

1 Reply

Loading