Abstract: Recent work in neural network interpretability has suggested that hidden activations of some deep models can be viewed as linear projections of much higher-dimensional vectors of sparse latent ``features.'' In general, this kind of representation is known as a superposition code. This work presents an information-theoretic account of superposition codes in a setting applicable to interpretability. We show that when the number $k$ of active features is very small compared to the number $N$ of total features, simple inference methods currently used by sparse autoencoders can reliably decode a $d$-dimensional superposition code when $d$ is a constant factor greater than the Shannon limit. Specifically, when $\ln k / \ln N \le \eta < 1$ and $H$ is the entropy of the latent vector in bits, asymptotically it suffices that $d / H > C(\eta)$ for certain increasing functions $C(\eta).$ However, the behavior of $C(\eta)$ depends on what decoding method is used. For example, when $\eta = 0.3$, we empirically show that a method based on the popular top-$k$ activation function typically requires a factor of $C = 4$ dimensions per bit. On the other hand, we exhibit an algorithm that succeeds with less than $2$ dimensions per bit and requires only around $3$ times as many FLOPs for the same values of $(N, d).$ We hope this work helps connect research in interpretability with perspectives from compressive sensing and information theory.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Antoine_Patrick_Isabelle_Eric_Ledent1
Submission Number: 7480
Loading