The Connectome of a Large Language Model

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Circuit Analysis, Attribution Graphs, Feature Geometry
Other Keywords: Network Analysis, Renormalization Group
TL;DR: We apply large-scale network analysis toolbox to transcoder feature networks and find non-trivial network properties corresponding to cortical networks.
Abstract: A central question about information processing systems is whether fundamentally different optimization processes lead to similar internal organization. Cortical networks exhibit small-world topology, modular communities, scale-free hubs, and structural-functional dissociation, and sparse transcoders now provide the interpretable units needed to test these same properties inside a large language model. We use skip transcoders from Gemma Scope 2 to extract 16,384 sparse interpretable features per MLP layer of Gemma 3 (270M and 1B), construct per-layer co-activation networks from 1M tokens of FineWeb-Edu, and analyze them with a standard network-neuroscience toolbox augmented by the Laplacian renormalization group and the spectral heat capacity. Both models reproduce all four classical cortical signatures at quantitatively similar values. Multi-scale analysis further reveals a dual-regime fractal scaling with a local exponent $d_B \approx 1.35$ below the Leiden community diameter, a single-peaked heat capacity that sharpens with depth, and a non-monotonic cross-layer coupling trajectory passing through directed, broadcast, and focused regimes. Hub features and coarse-grained communities carry interpretable semantics organized along a tokenization-syntax-semantics-morphology hierarchy. These findings suggest that next-token prediction on web-scale text pushes transformer representations toward a network organization that, in quantitative terms, parallels the cortex.
Submission Number: 541
Loading