What Arranges Features in Activation Space? Non-Classical Predictive Geometry in Next-Token Predictors
Keywords: Feature Geometry
TL;DR: Non-Classical Predictive Geometry Explains the Arrangement of Features in Activation Space.
Abstract: Mechanistic interpretability often studies the local features and circuits that implement model computations. What principles govern the arrangement of these
features and circuits into geometric structures in activation space? To make this
tractable, we study how the computational class of the training-data generator constrains the geometry of predictive states. We show that while the data distribution
determines which features are required for prediction, a predictor realizes those
features as beliefs about its current latent state, and the generator class determines
the geometry of those beliefs. Using this theoretical insight, we design synthetic
datasets whose minimal predictive representations fall into different model classes,
and test which geometry neural networks learn. In particular, we train transformers, LSTMs, GRUs, and vanilla RNNs on datasets whose predictive geometries
are known analytically: a classical HMM process, a quantum-realizable process
with no finite-state HMM realization, and a generalized-probabilistic process with
no finite-dimensional quantum realization. Across architectures, a single affine
map from activations decodes the corresponding predictive representation in each
case: HMM beliefs in a latent simplex, Bloch-vector quantum states, or a finite-dimensional generalized predictive vector. These representations emerge during
training and fit the compact non-classical geometry far better than finite-order
classical Markov baselines. These results suggest that understanding predictive
representations requires asking not only which features a network represents, but
what geometry organizes those features.
Submission Number: 545
Loading