Abstract: _Sparse autoencoders_ have been used to interpret activity inside large language models as "superposition codes" for sparse, high-dimensional signals. The encoder layers of these autoencoders use simple methods, which we will call "one-step estimates," to read latent sparse signals from vectors of hidden neuron activations. This work investigates the reliability of one-step estimates on a generic family of sparse inference problems. We show that these estimates are remarkably inefficient from the point of view of coding theory: even in a "very sparse" regime, they are only reliable when the dimension of the code exceeds the entropy of the latent signal by a factor of $2.7$ dimensions per bit. In comparison, a very naive iterative method called matching pursuit can read superposition codes given just $1.3$ dimensions per bit. This opens the question of whether neural networks can achieve similar bitrates in their internal representations.
Primary Area: Theory->Deep Learning
Keywords: sparse autoencoders, coding theory, information, superposition codes, interpretability, mechanistic interpretability
Submission Number: 12246
Loading