Mechanistic Interpretability of Loop Control-Flow Generation

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Applications of interpretability, Methods (probing, steering, causal interventions), Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: We find which attention heads in Gemma-2 2B pick break vs continue in Python loops. The method partly works for break, fails for continue, and one head for continue actually moves the break answer, so heads track loop context, not keyword choice.
Abstract: Large language models are increasingly used for code generation, but the mechanisms by which they represent and produce program control flow remain poorly understood. We investigate how Gemma-2 2B predicts Python loop control-flow keywords, focusing on the syntactically related tokens break and continue. We combine Direct Logit Attribution (DLA) with activation patching to nominate and causally test attention heads involved in these predictions. For break, we find a localized, distributed circuit: jointly patching the top five DLA-ranked heads recovers 25.4% (95% CI [15.3, 34.1]) of the clean–corrupted logit-difference gap across 100 random-paired prompts. For continue, the analogous intervention produces only a small metric movement, driven by changes in the competing break logit rather than suppression of continue itself, with gap recovery statistically indistinguishable from zero. The DLA-ranked heads are thus causally involved but do not function as direct keyword-selecting components. Two syntactically similar control-flow predictions therefore rely on qualitatively different internal mechanisms, illustrating both the utility and the limitations of DLA-guided activation patching for circuit discovery in code-trained models.
Submission Number: 200
Loading