Keywords: Foundational work
Other Keywords: Superposition, Feature geometry, Linear Representation Hypothesis
TL;DR: Small latent spaces and weight decay lead to linear PCA-like superposition under which feature geometry reflects the correlations in the data, explaining previously observed feature geometries.
Abstract: Recent advances in mechanistic interpretability have shown that many features of deep learning models can be captured by dictionary learning approaches such as sparse autoencoders. However, our geometric intuition for how features arrange themselves in a representation space is still limited. ''Toy‑model'' analyses have shown that in an idealized setting features can be arranged in local structures, such as small regular polytopes, through a phenomenon known as _superposition_. Yet these local structures have not been observed in real language models. In contrast, these models display rich structures like ordered circles for the months of the year or semantic clusters which are not predicted by current theories. In this work, we introduce Bag‑of‑Words Superposition (BOWS), a framework in which autoencoders with a ReLU in the decoder are trained to compress sparse, binary bag‑of‑words vectors drawn from Internet‑scale text. This simple set-up reveals the existence of a _linear regime_ of superposition, which appears in ReLU autoencoders with small latent sizes or which use weight decay. We show that this linear PCA-like superposition naturally gives rise to the same semantically rich structures observed in real language models. Code is available under https://anonymous.4open.science/r/correlations-feature-geometry-AF54.
Submission Number: 272
Loading