Measuring Feature Sparsity in Language Models

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: sparse coding, sparse dictionary learning, interpretability, language models, superposition, polysemanticity, metrics
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We devise metrics to measure the success of sparse coding techniques on language model activations and use them to assess the extent to which activations can be accurately represented as sparse linear combinations of feature vectors.
Abstract: Recent works have proposed that intermediate activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of the input text. Under this assumption, these works have aimed to reconstruct these feature directions using sparse coding. We develop metrics which can be used to assess the success of these sparse coding techniques and thereby implicitly test the validity of the linearity and sparsity assumptions. We show that our metrics can predict the level of sparsity on synthetic sparse linear activations, and that they can distinguish between sparse linear data and several other distributions. We use our metrics to measure the level of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers, and least sparse in middle layers.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6394
Loading