Keywords: Linear representations, theory, low-rank
Abstract: Why do some internal concepts in a neural network become readily accessible to simple linear probes, while others remain difficult to extract? In this paper, we show that, under a fixed interaction dictionary and a low-dimensional bottleneck, next-token cross-entropy induces pre-linearization: the model preferentially preserves interaction directions that are both statistically common in the data and highly useful for prediction. In the simplest orthogonal setting, this yields an explicit commonness-times-usefulness law. In the general case, however, the relevant object is not an individual named feature, but a basis-invariant whitened subspace of interaction space. We test these predictions in controlled synthetic settings, trained transformers, and pretrained language models, and find that the directions singled out by the theory are the ones that become most linearly readable across all three settings; in the controlled settings they are also the most effective and selective directions for steering model behavior.
Submission Number: 10
Loading