MechInterp for Recurrent Computation: Time-Resolved Circuit Discovery in RNNs

Aishwarya Balwani

MechInterp for Recurrent Computation: Time-Resolved Circuit Discovery in RNNs

Aishwarya Balwani

Published: 11 Jun 2026, Last Modified: 24 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications of interpretability, Interpretability for Knowledge Discovery, Circuit Analysis, Attribution Graphs

Other Keywords: RNNs, Computational Neuroscience

TL;DR: Causal interventions + local Jacobian projections reveal when RNN neurons are task-critical and how their effective circuits dynamically reconfigure over time.

Abstract: Despite being one of computational neuroscience's most prominent modelling tools, recurrent neural networks (RNNs) have shown limited utility for revealing explicit structure-function relationships in neuronal circuits. This shortcoming reflects the fact that recurrent computations are distributed across neurons, timesteps, and internal states, as a consequence of which static summaries of weights or average activity often fail to reveal the transient causal interactions through which behavior may be implemented. In this work, we present a circuit discovery framework that adapts causal intervention techniques from mechanistic interpretability of large language models to the recurrent, multi-step computation in RNNs. By combining windowed ablations with Jacobian-based linearization of the hidden state trajectories, we estimate effective connectivity across the RNN as it evolves, thereby revealing how task-relevant computations are implemented through dynamically coordinated subcircuits. Across synthetic tasks with known mechanisms, our pipeline recovers operative circuits with high precision while demonstrating robustness advantages over correlation-based selection methods. On applying our methods to anatomically-constrained RNNs trained on the Allen Institute's Visual Behavior dataset,we recover VIP involvement in unexpected-stimulus processing and reveal temporally specific causal contributions invisible to static analyses. Our experiments further suggest that intrinsic VIP timing drives prediction error formation in the network, consequently bridging the timing-code and predictive-processing interpretations of VIP function.

Submission Number: 584

Loading