Detection Without Suppression: A Sign Flip in IOI Circuit Formation

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Circuit Analysis, Attribution Graphs, Methods (probing, steering, causal interventions), Feature Geometry
Other Keywords: Training Dynamics, Circuit Formation, IOI
TL;DR: During IOI circuit formation, the S2 residual stream flips from promoting the wrong answer to being essential for the correct one, coinciding with the emergence of S-inhibition heads.
Abstract: Transformers trained on language modeling develop the ability to identify indirect objects (IOI) non-monotonically: accuracy drops below chance before recovering. We observe this dip across 15 training runs spanning two model families, three scales, and nine seed/data/initialization variants. Linear probes show that name-duplication information is linearly decodable at 99% accuracy at the S2 position on Pythia-160M during the dip, while IOI accuracy is 41%; a roughly 89% S2 probe baseline is already present at random initialization. Activation patching reveals a sign flip: replacing S2 with a non-duplicate control reduces the S-bias during the dip (∆LD = +0.94) but degrades performance after recovery (∆LD = −4.13), while the same intervention at a non-S2 position produces ∆LD = 0.000. Head ablation shows that S-inhibition heads, which eventually contribute 10.8× more to the logit difference than name movers, have not yet acquired large measurable effects during the below-chance phase.
Submission Number: 532
Loading