Keywords: representational alignment, superposition, theory, neural geometry, sparse autoencoder, universality, linear regression, disentanglement
TL;DR: Neural networks often appear misaligned not because they learn different things, but because their neurons represent different mixtures of the exact same underlying features.
Abstract: Neural networks trained on the same tasks achieve similar performance but often show surprisingly low representational alignment. We argue this is a measurement artifact–a *mirage of misalignment*–caused by superposition, where individual neurons represent mixtures of features. Consequently, two networks representing identical feature sets can appear dissimilar if their neurons mix those features differently. To formalize this intuition, we derive an analytic theory that predicts this apparent misalignment for common linear metrics like representational similarity analysis and linear regression. We validate our theory in settings of increasing complexity. It perfectly predicts misalignment between random projections of identical features. On real data, we use sparse autoencoders to find underlying disentangled features, showing their latent codes are often far more aligned than the raw neural representations. This work reveals that linear alignment metrics, when applied to raw neural activations, can be systematically misleading due to superposition. Our findings suggest that neural networks are more aligned than previously believed and that the common practice of comparing raw neural activations with linear probing may systematically underestimate model similarity.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 21678
Loading