Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, causal intervention, probing limitations, trustworthy representation analysis, distributed encoding robustness, fault-tolerant task representations, activation steering reliability, interpretability-causality gap, verifiable AI mechanisms, robust few-shot learning, principled model analysis, causal tracing, representation faithfulness, intervention-based interpretability, safe activation editing, reliable task transfer, LLM transparency
TL;DR: Linear probing is causally blind: 100% accuracy yields 0% transfer. Multi-position intervention achieves 96%, proving ICL task identity emerges from distributed output templates—fundamentally overturning the localized representation paradigm in LLMs.
Abstract: Understanding how large language models encode task identity from few-shot demonstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B—despite 100% probing accuracy at those same positions. This null result is itself a key finding, demonstrating that task encoding is fundamentally distributed. Multi-position intervention—replacing activations at all demonstration output tokens simultaneously—achieves up to 96% transfer (N=50, 95% CI: [87%, 99%]) at layer 8, pinpointing for the first time the causal locus of ICL task identity. We establish the generality of these findings across four models spanning three architecture families (LLaMA, Qwen, Gemma), discovering a universal intervention window at ∼30% network depth. Causal tracing uncovers an asymmetric architecture: the query position is strictly necessary (53–100% disruption) while no individual demonstration position is necessary (0% disruption)—resolving a key ambiguity in prior accounts. Crucially, transfer depends on internal representation compatibility, not surface similarity (r=−0.05 vs r=0.31), ruling out trivial explanations. These results establish the distributed template hypothesis: ICL task identity is encoded as output format templates distributed across demonstration tokens, fundamentally reshaping our understanding of how in-context learning operates.
Submission Number: 22
Loading