What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
Keywords: In-Context Learning, Language Modeling, Induction Circuits
Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient, when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate the emergence of induction heads and enhance ICL performance, we introduce $\textit{Bi-Induct}$, a lightweight curriculum that injects forward-copy ($\textit{Induction}$), backward-copy ($\textit{Anti}$, as a control), or a balanced mix, into the pretraining stream. We conduct iso-FLOPs pretraining across models from 0.13B to 1B parameters, evaluating effects across three axes: (i) few-shot performance on ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity.
Our findings challenge the intuition that early induction circuit activation directly translates to better ICL. While Bi-Induct accelerates induction head emergence at smaller scales, this does not consistently yield better few-shot generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only model performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends.
Telemetry reveals that larger models trained only on natural text develop broader and earlier-peaking induction heads, despite seeing no explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting that larger models can absorb non-natural patterns with minimal cost.
Crucially, ablating the top 2% of induction heads per layer degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, pointing to different circuit utilization patterns.
Overall, we find that inducing activation is not sufficient: improvements in ICL hinge on whether these circuits become functionally necessary. These results underscore the importance of mechanism-aware pretraining diagnostics and data mixtures that foster $\textit{load-bearing}$, not merely present, structure.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14373
Loading