Measuring Feature Sparsity in Language Models

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR SpotlightEveryoneRevisionsBibTeX
Keywords: sparse coding, sparse dictionary learning, interpretability, language models, superposition, polysemanticity, metrics
TL;DR: We devise metrics to measure the success of sparse coding techniques on language model activations and use them to assess the extent to which activations can be accurately represented as sparse linear combinations of feature vectors.
Abstract: Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.
Submission Number: 38