Adversarial Attacks on Neuron Interpretation via Activation Maximization

NeurIPS 2023 Workshop ATTRIB Submission42 Authors

Published: 27 Oct 2023, Last Modified: 08 Dec 2023ATTRIB PosterEveryoneRevisionsBibTeX
Keywords: interpretability, feature visualization, adversarial model manipulation
Abstract: Feature visualization is one of the most popular techniques to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding $\textit{synthetic}$ or $\textit{natural}$ inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of fine-tuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the ImageNet classification task.
Submission Number: 42