Adversarial Attacks Leverage Interference Between Features in Superposition

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundational work, Understanding high-level properties of models
Other Keywords: superposition, adversarial vulnerability
TL;DR: This paper shows that adversarial attacks exploit the geometric arrangements of superposed features in neural network representations, revealing a tradeoff between representation capacity and adversarial vulnerability.
Abstract: Fundamental questions remain about why adversarial examples arise in neural networks. In this paper, we demonstrate that adversarial vulnerability can emerge from feature superposition—where networks represent more latent features than they have dimensions. Through controlled experiments on toy models and vision transformers (ViT), we show how data properties induce specific superposition geometries that adversaries systematically exploit. We demonstrate that adversarial perturbations leverage interference patterns between superposed features to craft attacks, with the geometric arrangement of these features determining attack characteristics. Our framework provides a mechanistic explanation for two known phenomena: adversarial attack transferability between models with similar training regimes and class-specific vulnerability. We demonstrate these findings persist beyond toy settings with ViTs trained on CIFAR-10 with an engineered bottleneck. These results show that adversarial vulnerability can stem from efficient information encoding in neural networks, rather than from flaws in the learning process or non-robust input features.
Submission Number: 137
Loading