Keywords: Mechanistic Interpretability, RNNs, Activation Patching, Circuit Analysis
TL;DR: Mechanistic interpretability for time-resolved, sub-task specific neural circuit discovery in RNNs
Abstract: Recurrent neural networks (RNNs) have been widely adopted as models of cortical computation, yet their utility for understanding neural mechanisms and explicit structure-function relationships has been limited by their opacity. Recent advances in mechanistic interpretability offer new hope for opening these black boxes, moving beyond correlation-based analyses to causal understanding. Building on these developments, we present a time-resolved circuit discovery method that reveals how RNNs implement computations through dynamically coordinated subcircuits. Specifically, we combine windowed causal interventions with time-resolved linearization to identify task-critical neurons and visualize the dynamic reconfiguration of effective connectivity, exposing the temporal orchestration of information flow. We validate our pipeline on two synthetic tasks with known ground truth: i) a ring attractor network in which we successfully recover neuronal circuits underlying static states, as well as traveling and jumping bump dynamics, and ii) a hidden Markov model inference task whose discovered circuits for hidden state inference match full-network decoding performance while maintaining robustness under noise. We demonstrate our approach on RNNs trained with Dale's law constraints to perform a context-dependent flip-flop task, identifying distinct circuits for memory maintenance, state switching, and context-gated control. We find that excitatory and inhibitory neurons show consistent functional specialization: memory circuits are dominated by recurrent excitation, while switching circuits recruit inhibitory neurons at transition points. Critically, our time-resolved analysis reveals that during context switches, the memory circuit remains stable while a separate gating circuit dynamically reconfigures; a temporal dissociation invisible to static analyses. These findings demonstrate that mechanistic interpretability can bridge the gap between artificial and biological neural networks, transforming RNNs from black-box function approximators into white-box models of neural computation. We hope that our work encourages further development of such tools that bear the promise to advance our understanding of both, artificial and biological intelligence.
Submission Number: 139
Loading