Measuring Superposition with Sparse Autoencoders — Does Superposition Cause Adversarial Vulnerability?
Abstract: Neural networks achieve remarkable performance through \textit{superposition}—encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This phenomenon fundamentally challenges interpretability: when neurons respond to multiple unrelated concepts, understanding network behavior becomes intractable. Yet despite its central importance, we lack principled methods to measure superposition. We present an information-theoretic framework that measures the effective number of features through the exponential of Shannon entropy applied to sparse autoencoder activations. This threshold-free metric, grounded in rate-distortion theory and analogy to quantum entanglement, provides the first universal measure of superposition applicable to any neural network.
Our approach demonstrates strong empirical validation: correlation with ground truth exceeds 0.94 in toy models, accurately detects minimal superposition in algorithmic tasks (feature count approximately equals neuron count), and reveals systematic feature reduction under capacity constraints (up to 50\% reduction with dropout). Layer-wise analysis of Pythia-70M reveals feature counts peak in early-middle layers at 20 times the number of neurons before declining—mirroring patterns observed in intrinsic dimensionality studies. The metric also captures developmental dynamics, detecting sharp reorganization during grokking phase transitions where models shift from superposed memorization to compact algorithmic solutions.
Surprisingly, adversarial training can increase feature counts by up to 4× while improving robustness, contradicting the hypothesis that superposition causes vulnerability. The effect depends on task complexity and network capacity: simple tasks and ample capacity enable feature expansion, while complex tasks or limited capacity force feature reduction.
By providing a principled, threshold-free measure of superposition, this work enables quantitative study of neural information organization.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 5627
Loading