On Bitrates of Very Sparse Superposition Codes

Christopher Neil Gadzinski; Decebal Constantin Mocanu

On Bitrates of Very Sparse Superposition Codes

Christopher Neil Gadzinski, Decebal Constantin Mocanu

23 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: _Sparse autoencoders_ have been used to interpret activity inside large language models as "superposition codes" for sparse, high-dimensional signals. The encoder layers of these autoencoders use simple methods, which we will call "one-step estimates," to read latent sparse signals from vectors of hidden neuron activations. This work investigates the reliability of one-step estimates on a generic family of sparse inference problems. We show that these estimates are remarkably inefficient from the point of view of coding theory: even in a "very sparse" regime, they are only reliable when the dimension of the code exceeds the entropy of the latent signal by a factor of $2.7$ dimensions per bit. In comparison, a very naive iterative method called matching pursuit can read superposition codes given just $1.3$ dimensions per bit. This opens the question of whether neural networks can achieve similar bitrates in their internal representations.

Primary Area: Theory->Deep Learning

Keywords: sparse autoencoders, coding theory, information, superposition codes, interpretability, mechanistic interpretability

Submission Number: 12246

Loading