Keywords: Pretraining
TL;DR: We study LLM pretraining with logit distillation and show that it comes with a tradeoff: it boosts test-time scaling but impairs in-context learning. We analyze this tradeoff in detail and develop mitigation strategies.
Abstract: In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining,
exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been
shown to improve statistical modeling, its effects on new paradigms key to modern LLMs—such as
test-time scaling and in-context learning—remain underexplored. In this work, we make three main
contributions. First, we show that pretraining with distillation yields models that exhibit remarkably
better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation
impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to
demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps
us isolate the common principal factor behind our observations. Finally, using these insights, we shed
light on various design choices for pretraining that should help practitioners going forward.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6313
Loading