A Path Already Walked: On Inheriting Network-Neuroscience Tools for Mechanistic Interpretability

Phongsakon Mark Konrad; Toygar Tanyel; Serkan Ayvaz

A Path Already Walked: On Inheriting Network-Neuroscience Tools for Mechanistic Interpretability

Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Circuit Analysis, Attribution Graphs, Applications of interpretability

Other Keywords: network neuroscience, graph theory, effective connectivity, null models, attribution graphs

TL;DR: This paper proposes a disciplined framework for importing network-neuroscience graph tools into mechanistic interpretability by requiring explicit transformer graph contracts, null models, and falsifiable translations.

Abstract: Mechanistic interpretability is moving from neurons and heads toward circuits, dictionary features, and attribution graphs. That transition is productive, but it also raises a familiar issue. Many important phenomena are relational rather than component-local. Network neuroscience has spent two decades building graph vocabulary, null models, and failure modes for related problems. We argue for a disciplined import rather than a loose brain analogy. We specify the transformer graph contract required before the import is meaningful, give a compact mapping from network-neuroscience primitives to transformer analyses, work through a local effective-connectivity proxy for gated MLPs, and state eight testable translations with failure criteria. We do not report transformer experiments, and we do not claim neuroscience results transfer automatically.

Submission Number: 668

Loading