Keywords: Circuit Analysis, Attribution Graphs, Feature Geometry, Applications of interpretability
Other Keywords: Neuroscience
TL;DR: Representational alignment metrics measure behavior of neural populations, but are blind to the exact function producing this behavior. We propose encoding manifolds and GW distance as complimentary tools for analyzing functional alignment
Abstract: RSA and CKA are standard metrics for comparing neural representations across brain regions, organisms, and deep learning models. We demonstrate a fundamental weakness: these decoding-based metrics are insensitive to encoding manifold topology — the internal functional organization of a neural population. In a controlled MNIST experiment, RSA, CKA, and Procrustes $R^2$ remain statistically unchanged when encoding topology is causally manipulated via an auxiliary clustering loss, while the two model populations differ significantly in attribution patterns, weight-graph assortativity, and out-of-distribution robustness. Across biological systems and machine learning models, similar decoding behavior can arise from small, non-representative subpopulations, and alignment metrics are insensitive to encoding manifold topology even when it is fundamentally altered. These findings bear directly on mechanistic interpretability: standard alignment metrics cannot distinguish whether two networks share the same computational circuits or merely produce indistinguishable aggregate outputs. We propose encoding manifolds and Gromov–Wasserstein distance as complementary diagnostics for any decoding-based similarity claim, and provide a Neural Manifold Explorer tool.
Submission Number: 330
Loading