['1c1', '< Title: TRAINING ON THE TEST TASK CONFOUNDS EVALUATION AND EMERGENCE', '---', '> Title: Training on the Test Task: A Critical Confounder in LLM Evaluation and Emergence', '3c3', '< Abstract: We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models, with broad implications for benchmarking and the study of emergent capabilities. We propose a simple and effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model on the same, sufficient amount of task-specific data before evaluation. To validate our method, we demonstrate its effectiveness in a controlled setting: we take the older models and fine-tune them on the test task. Remarkably, this recreates the kind of performance differences observed between newer and older models, further suggesting that training on the test task explains the improvements of newer models. We then show that we can undo the advantage of the fine-tuned models over the other models by further fine-tuning all models on the test task (Section 3.1, Figure 3). Recent models outperform older ones given the same pretraining compute. We evaluate models on MMLU and GSM8K, and plot benchmark accuracy against pretraining compute in Figure 1 top. We observe that performance correlates with pretraining compute for both benchmarks. However, on the surface it appears that models released after November 2023 better leverage pretraining compute. For a given compute budget, newer models tend to attain better benchmark performance.', '---', '> Abstract: We investigate "training on the test task," a widespread practice in large language model (LLM) development where knowledge about evaluation tasks is utilized during training. Unlike data contamination, this practice is not a malpractice but profoundly impacts evaluation outcomes. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a differing degree of training on the test task. To address this, we propose an effective method to adjust for this effect: fine-tuning each model under comparison on the same, sufficient amount of task-relevant data before evaluation. We show that instances of emergent behavior gradually diminish as models train on the test task. Our work offers a new perspective on LLM evaluation, with broad implications for benchmarking and the study of emergent capabilities. We validate our method by demonstrating that fine-tuning older models on the test task recreates performance differences observed between newer and older models, suggesting this practice explains recent improvements. Further, we show that this advantage can be nullified by fine-tuning all models on the test task (Section 3.1, Figure 3). While recent models appear to outperform older ones given the same pretraining compute on benchmarks like MMLU and GSM8K (Figure 1 top), we hypothesize this is largely due to variations in training on the test task.', '6,8c6,10', "< The machine learning community has long recognized certain clear violations of the benchmarking protocol. Training on the test set is the most notorious among them (Duda & Hart, 1973;Hastie et al., 2017). Data leakage (Kapoor & Narayanan, 2022) and data contamination (Roberts et al., 2023;Jiang et al., 2024) are closely related problems linked to the rise of massive web-crawled training datasets. Researchers can all agree that test data should never appear in the training set. But it's been much less clear what to do about legitimate attempts to bring training closer to evaluation. There is an obvious a gap between next token prediction at training time and tasks, such as reasoning and question answering, at test time. Ongoing research and engineering efforts, in fact, aim to narrow precisely this gap (MetaAI, 2024). Why shouldn't training be informed by knowledge about the downstream test tasks? What's an unfair advantage for some may be the feature of others.", '< In this work, we group strategies to utilize task knowledge at training time under the umbrella term of training on the test task. Examples of training on the test task include the use of instructiontuning data or question answering templates during pre-training (Bai et al., 2023;StabilityAI, 2023;Groeneveld et al., 2024). Models may also implicitly train on the test task when their pretraining data is selected through ablations on downstream benchmark evaluations (Gemma et al., 2024;MetaAI, 2024). We work from the premise that training on the test task is acceptable-or, at least, unavoidable.', '< In a nutshell, we show that training on the test task strongly confounds model comparisons across different scales and model families. Perhaps counterintuitively, we propose to mitigate the effects of training on the test task on benchmark evaluations by doing more of it. We show that we can effectively level the playing field by giving each model the same, sufficient task-specific fine-tuning before evaluation. This adjustment restores cleaner log-linear scaling and makes capabilities predictable based on much smaller model scales.', '---', '> The machine learning community has long established clear protocols for benchmarking, with "training on the test set" being the most egregious violation (Duda & Hart, 1973; Hastie et al., 2017). Related issues like data leakage (Kapoor & Narayanan, 2022) and data contamination (Roberts et al., 2023; Jiang et al., 2024) have become increasingly relevant with the advent of massive web-crawled training datasets. While there is universal agreement that test data must remain separate from training data, the community faces a less clear challenge regarding legitimate efforts to align training with evaluation objectives. A noticeable gap exists between the general objective of next-token prediction during pre-training and specific downstream tasks like reasoning and question answering at test time. Current research and engineering actively seek to bridge this gap (MetaAI, 2024). This raises a critical question: should training be informed by knowledge of downstream evaluation tasks? What some might view as an unfair advantage, others consider a necessary feature for practical utility.', '> ', '> In this work, we introduce the term "training on the test task" to encompass various strategies that leverage knowledge about evaluation tasks during training. This includes practices such as incorporating instruction-tuning data or question-answering templates into pre-training (Bai et al., 2023; StabilityAI, 2023; Groeneveld et al., 2024). Models can also implicitly train on the test task when their pre-training data mixtures are optimized through ablations on downstream benchmark evaluations (Gemma ett al., 2024; MetaAI, 2024). We operate under the premise that training on the test task is not only acceptable but, in many modern contexts, unavoidable.', '> ', '> Our core finding is that training on the test task significantly confounds model comparisons across different scales and model families. Counterintuitively, we propose to mitigate these confounding effects on benchmark evaluations by embracing and standardizing the practice. We demonstrate that providing each model with the same, sufficient task-specific fine-tuning before evaluation effectively levels the playing field. This adjustment not only restores cleaner log-linear scaling relationships but also makes model capabilities predictable from much smaller scales.', '16,19c18,20', '< Next, we provide evidence that training on the test task may be a more dominant factor in benchmark performance than data contamination. To argue this point, we consider ARC and HellaSwag. Here, at first, there appears to be no sign of newer models having an unfair advantage over older models. But after reformulating these benchmarks as MMLU-style multiple choice question answering tasks (MCQA), we see the same confounded results as for MMLU (Section 3.2, Figure 4). This suggests that the improvements of newer models on MMLU are likely not because of memorization of specific testing data, but rather due to an improved ability for MCQA tasks.', '< Then, we show how training on the test task distorts model family comparisons. Certain model families appear markedly superior to others before adjusting for test task training, but not after adjustment (Section 4.1, Figure 6). We then demonstrate how training on the test task has inflated the perceived progress made by recent model families. After adjusting for its effect, newer models only modestly improve the Pareto frontier of model performance relative to pre-training compute.', '< Finally, we demonstrate that training on the test task has profound implications for the study of emergent capabilities. The phenomenon of emergence disappears gradually as the amount of training on the test task grows (Section 5). Specifically, we can make capabilities visible and predictable from much smaller model scales, recovering cleaner log linear-scaling. Importantly, our adjustment also works in cases, like MMLU, where previous purported explanations of emergence, such as the choice of evaluation metric, do not suffice.', '< Our work calls for a major reorientation of large language model evaluation. Model comparisons and claims of emergence are strongly confounded by the choice of training data relative to the test tasks. When comparing models with different pre-training data, our recommendation is to give each model the same sufficient amount of fine-tuning on task-relevant data before evaluation.', '---', '> Next, we present compelling evidence that "training on the test task" is a more significant driver of benchmark performance than data contamination. We examine ARC Challenge and HellaSwag benchmarks, where initially, newer models show no discernible advantage over older models. However, when these benchmarks are reformulated as MMLU-style multiple-choice question answering (MCQA) tasks, we observe the same confounding effects as seen with MMLU (Section 3.2, Figure 4). This crucial finding indicates that the performance gains of newer models on MMLU are likely not attributable to the memorization of specific test data, but rather to an enhanced proficiency in MCQA tasks.', '> ', '> Furthermore, we illustrate how training on the test task distorts comparisons between different model families. Certain families appear significantly superior before adjusting for test task training, but this perceived advantage vanishes after our proposed adjustment (Section 4.1, Figure 6). We also demonstrate that training on the test task has inflated the reported progress in model capabilities over time. After accounting for its effects, newer models show only modest improvements to the Pareto frontier of model performance relative to pre-training compute.', '20a22,25', '> Finally, we reveal the profound implications of training on the test task for the study of emergent capabilities. We show that the phenomenon of emergence gradually disappears as the extent of training on the test task increases (Section 5). Specifically, capabilities become observable and predictable at much smaller model scales, leading to the recovery of cleaner log-linear scaling. Importantly, our adjustment proves effective even in cases, such as MMLU, where previous explanations for emergence (e.g., choice of evaluation metric) are insufficient.', '> ', '> Our work advocates for a fundamental reorientation of large language model evaluation. We assert that model comparisons and claims of emergence are heavily confounded by the relationship between training data and test tasks. When comparing models with diverse pre-training data, we recommend standardizing evaluation by providing each model with the same, sufficient amount of task-relevant fine-tuning before assessment.', '> ', '411d415', '< ']
