Adversarial Examples Are Not Bugs, They Are Superposition

Liv Gorton; Owen Lewis

Adversarial Examples Are Not Bugs, They Are Superposition

Liv Gorton, Owen Lewis

Published: 30 Sept 2025, Last Modified: 05 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Understanding high-level properties of models

TL;DR: This paper presents theoretical and experimental evidence that superposition may be the fundamental mechanism underlying adversarial vulnerability in deep learning models.

Abstract: Adversarial examples—inputs with imperceptible perturbations that fool neural networks—remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that \textit{superposition}, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training) controls superposition.

Submission Number: 92

Loading