Dissecting In-Context Learning: A Mechanistic Analysis of Emergent Circuits in Small Language Models
Keywords: in-context learning, mechanistic interpretability, language models, circuits, transformers
TL;DR: We dissect how small transformer models implement in-context learning by identifying four causal circuit types and showing their consistency across scales.
Abstract: In-context learning (ICL) enables language models to adapt to new tasks from just a few examples, yet the mechanistic basis of this capability remains poorly understood. We present a comprehensive analysis of the circuits underlying ICL in transformer models ranging from 125M to 1.3B parameters. Through systematic interventions and causal analysis, we identify four distinct circuit types that emerge during training: copy circuits that replicate patterns, induction circuits that abstract rules, composition circuits that combine information, and task recognition circuits that identify problem types. We demonstrate that these circuits are (1) causally responsible for ICL performance through targeted ablations showing 73% average performance degradation, (2) transferable across model scales with 0.82 correlation in circuit structure, and (3) surgically enhanceable, achieving 28% improvement on targeted tasks. Our analysis reveals that ICL emerges through the coordinated interaction of 12–15 critical attention heads forming interpretable computational graphs. We provide an open-source toolkit for ICL circuit analysis and demonstrate applications to model debugging and capability enhancement. These findings offer actionable insights for improving model interpretability and engineering more capable systems.
Primary Area: interpretability and explainable AI
Submission Number: 2843
Loading