Superposition as Lossy Compression — Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

Superposition as Lossy Compression — Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

TMLR Paper5627 Authors

13 Aug 2025 (modified: 29 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Neural networks achieve remarkable performance through superposition—encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This phenomenon challenges interpretability: when neurons respond to multiple unrelated concepts, understanding network behavior becomes difficult. Yet despite its importance, we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply the Shannon entropy to sparse autoencoder activations to compute the number of effective features as the the minimum number of neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than they have actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks (effective features approximately equal neurons), and reveals systematic reduction under dropout. Layer-wise patterns of effective features mirror studies of intrinsic dimensionality on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during the grokking phase transition. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect of adversarial training on superposition depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force feature reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled, practical measurement of how neural networks organize information under computational constraints, in particular, connecting superposition to adversarial robustness.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: revised introduction, related works, method, validation and findings sections.

Assigned Action Editor: ~Pin-Yu_Chen1

Submission Number: 5627

Loading