Keywords: linear representation hypothesis, atomic features, foundations of interpretability
TL;DR: We formalize and provide evidence for an atomic representation hypothesis, where there is a fundamental set of features that compose language model representations
Abstract: Features serve as a core conceptual building block in the study of language models. We formalize the hypothesis that there exists an atomic set of features and consider if sparse autoencoders (SAEs) are capable of recovering these features. Starting from the theory, we derive testable hypotheses including a stability principle: that under the hypothesis, SAEs of increasing size recover a growing set of stable features (which we expect are the atomic features). We demonstrate that this principle holds in practice on SAEs trained on two large embedding models. We also find evidence for three other testable predictions: recovery of features at each level of a hierarchy, recovery of shared features using different training data distributions, and recovery of shared features across SAEs trained on different embedding models (evidence of platonicity). Our results suggest the expressivity and fidelity of a theory of atomic features. Practically, our results suggest that scaling SAEs can provide more granularity while retaining stable high-level features.
Submission Number: 129
Loading