Don't trust your eyes: on the (un)reliability of feature visualizations

Published: 20 Jun 2023, Last Modified: 07 Aug 2023AdvML-Frontiers 2023EveryoneRevisionsBibTeX
Keywords: adversarial model manipulation, feature visualization, interpretability, explainability, deep learning, neural networks, analysis, theory, activation maximization
TL;DR: How reliable are feature visualizations? We investigate this question through the lens of an adversary, empirically, and theoretically. All three perspectives cast doubt on the reliability of feature visualizations (e.g., they can be manipulated).
Abstract: How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include black-box neural networks.
Submission Number: 32
Loading