Reverse Engineering a Stateful Reasoning Circuit

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/komikat/prep-gated-circuits
Keywords: Circuit analysis, Foundational work
Other Keywords: linguistic probing
TL;DR: We reverse-engineer a “Query-Gated Courier” circuit in Gemma-2-2B for role-gated retrieval.
Abstract: We study Gemma-2-2B on a controlled role-gated retrieval task where a prepositional gate ($\texttt{to}$ or $\texttt{from}$) selects which of two entities is correct. On 60 single-token name pairs the model attains 100\% accuracy with a mean flip magnitude \($\approx$ 3.5\) (sum of per-condition correctness margins). Using causal tracing, we identify a Query-Gated Courier circuit with three stages: (1) a gate token (from/to) writes a role feature at the answer; (2) this feature perturbs late-layer courier queries, shifting their $\(q \cdot k\)$ preference; (3) couriers attend to the correct name and inject it via OV, raising its logit. Gate-residual swaps flip predictions, and a compact nine-head keep set reproduces the behavior with high fidelity. The circuit gives a potential algorithm for role tracking and aligns with the Paninian Kāraka analysis, mapping $\texttt{to}$ to sampradāna and $\texttt{from}$ to apādāna.
Submission Number: 307
Loading