Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

Published: 02 Mar 2026, Last Modified: 18 Mar 2026LIT Workshop @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: in-context learning, mechanistic interpretability, negative results, linear probing limitations, activation intervention, distributed representations, large language models, causal analysis, probing-causality dissociation, task encoding, few-shot learning, transformer internals, activation patching, representation localization, activation steering, fault-tolerant encoding, template hypothesis, intervention failure
TL;DR: Linear probing is causally blind: 100% accuracy yields 0% transfer. Multi-position intervention achieves 96%, proving ICL task identity emerges from distributed output templates—fundamentally overturning the localized representation paradigm in LLMs.
Abstract: Understanding how large language models encode task identity from few-shot demonstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B—despite 100% probing accuracy at those same positions. This null result is itself a key finding, demonstrating that task encoding is fundamentally distributed. Multi-position intervention—replacing activations at all demonstration output tokens simultaneously—achieves up to 96% transfer (N=50, 95% CI: [87%, 99%]) at layer 8, pinpointing for the first time the causal locus of ICL task identity. We establish the generality of these findings across four models spanning three architecture families (LLaMA, Qwen, Gemma), discovering a universal intervention window at ∼30% network depth. Causal tracing uncovers an asymmetric architecture: the query position is strictly necessary (53–100% disruption) while no individual demonstration position is necessary (0% disruption)—resolving a key ambiguity in prior accounts. Crucially, transfer depends on internal representation compatibility, not surface similarity (r=−0.05 vs r=0.31), ruling out trivial explanations. These results establish the distributed template hypothesis: ICL task identity is encoded as output format templates distributed across demonstration tokens, fundamentally reshaping our understanding of how in-context learning operates.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Bryan_Cheng1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 11
Loading