Keywords: Methods (probing, steering, causal interventions), Circuit Analysis, Attribution Graphs, Applications of interpretability
Other Keywords: network neuroscience, graph theory, effective connectivity, null models, attribution graphs
TL;DR: This paper proposes a disciplined framework for importing network-neuroscience graph tools into mechanistic interpretability by requiring explicit transformer graph contracts, null models, and falsifiable translations.
Abstract: Mechanistic interpretability is moving from neurons and heads toward circuits, dictionary features, and attribution graphs. That transition is productive, but it also raises a familiar issue. Many important phenomena are relational rather than component-local. Network neuroscience has spent two decades building graph vocabulary, null models, and failure modes for related problems. We argue for a disciplined import rather than a loose brain analogy. We specify the transformer graph contract required before the import is meaningful, give a compact mapping from network-neuroscience primitives to transformer analyses, work through a local effective-connectivity proxy for gated MLPs, and state eight testable translations with failure criteria. We do not report transformer experiments, and we do not claim neuroscience results transfer automatically.
Submission Number: 668
Loading