Keywords: Language models, in-context learning, reasoning, interpretability
TL;DR: If a shallow auxiliary prediction head struggles to approximiate the full next token prediction, we can infer that the model is doing complex in-context computation.
Abstract: Measuring the in-context computational effort of language models is a key challenge, as standard metrics like next-token loss fail to capture the complexity of the underlying reasoning. Prior work based on latent state compression is promising but can be invasive and unstable to train. In this paper, we propose Multiple Token Divergence (MTD), a simple and direct measure of computational effort that quantifies the KL divergence between the full model's output distribution and that of a shallow, auxiliary prediction head. An MTD module can easily be inserted into a language model. Alternatively, pre-trained multiple token prediction heads that are included in some state-of-the-art models can be used directly, requiring no further training. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, we find that MTD correlates positively with problem difficulty—in direct contrast to next-token loss—and that lower MTD is associated with more accurate self-generated reasoning. MTD provides a practical, lightweight tool for analyzing and understanding the computational dynamics of language models.
Submission Number: 202
Loading