Title: Training on the Test Task: A Critical Confounder in LLM Evaluation and Emergence

Abstract: We investigate "training on the test task," a widespread practice in large language model (LLM) development where knowledge about evaluation tasks is utilized during training. Unlike data contamination, this practice is not a malpractice but profoundly impacts evaluation outcomes. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a differing degree of training on the test task. To address this, we propose an effective method to adjust for this effect: fine-tuning each model under comparison on the same, sufficient amount of task-relevant data before evaluation. We show that instances of emergent behavior gradually diminish as models train on the test task. Our work offers a new perspective on LLM evaluation, with broad implications for benchmarking and the study of emergent capabilities. We validate our method by demonstrating that fine-tuning older models on the test task recreates performance differences observed between newer and older models, suggesting this practice explains recent improvements. Further, we show that this advantage can be nullified by fine-tuning all models on the test task (Section 3.1, Figure 3). While recent models appear to outperform older ones given the same pretraining compute on benchmarks like MMLU and GSM8K (Figure 1 top), we hypothesize this is largely due to variations in training on the test task.

Section: INTRODUCTION
The machine learning community has long established clear protocols for benchmarking, with "training on the test set" being the most egregious violation (Duda & Hart, 1973; Hastie et al., 2017). Related issues like data leakage (Kapoor & Narayanan, 2022) and data contamination (Roberts et al., 2023; Jiang et al., 2024) have become increasingly relevant with the advent of massive web-crawled training datasets. While there is universal agreement that test data must remain separate from training data, the community faces a less clear challenge regarding legitimate efforts to align training with evaluation objectives. A noticeable gap exists between the general objective of next-token prediction during pre-training and specific downstream tasks like reasoning and question answering at test time. Current research and engineering actively seek to bridge this gap (MetaAI, 2024). This raises a critical question: should training be informed by knowledge of downstream evaluation tasks? What some might view as an unfair advantage, others consider a necessary feature for practical utility.

In this work, we introduce the term "training on the test task" to encompass various strategies that leverage knowledge about evaluation tasks during training. This includes practices such as incorporating instruction-tuning data or question-answering templates into pre-training (Bai et al., 2023; StabilityAI, 2023; Groeneveld et al., 2024). Models can also implicitly train on the test task when their pre-training data mixtures are optimized through ablations on downstream benchmark evaluations (Gemma ett al., 2024; MetaAI, 2024). We operate under the premise that training on the test task is not only acceptable but, in many modern contexts, unavoidable.

Our core finding is that training on the test task significantly confounds model comparisons across different scales and model families. Counterintuitively, we propose to mitigate these confounding effects on benchmark evaluations by embracing and standardizing the practice. We demonstrate that providing each model with the same, sufficient task-specific fine-tuning before evaluation effectively levels the playing field. This adjustment not only restores cleaner log-linear scaling relationships but also makes model capabilities predictable from much smaller scales.

Section: OUR CONTRIBUTIONS
We introduce the term training on the test task to group a growing repertoire of practices that utilize knowledge about evaluation tasks at training time. We study its impact on present-day benchmark Base models trained after November 2023 outperform those trained before November 2023
After fine-tuning all models on the test task, differences in model performance vanish Models trained Before November 2023 After November 2023
Figure 1: MMLU and GSM8K scores of 56 base models, with model sizes ranging from 70M to 70B parameters. Solid lines correspond to the regression fit of A = α max(0, log C -c e ) + θN + r, where A is accuracy, C is pretraining compute, N is whether the model was trained after November 2023, and r is random chance accuracy. The coefficient θ denotes the average improvement of models trained after November 2023 when controlling for pretraining compute. Bold indicates statistical significance with p-value < 0.05. (Top) We hypothesize that training on the test task confounds benchmark evaluations, resulting in newer base models substantially outperforming older ones. (Bottom) We propose to adjust for differences in test task training by fine-tuning all models on the same, sufficient amount of task-specific data before evaluation. After fine-tuning on the test task, differences in benchmark performance between older and newer models vanish.
evaluations by critically examining the performance improvements of recent language models. Our analysis spans 56 different language models and two major active benchmarks, MMLU and GSM8K.
We start in Section 2 by dividing models into those trained before November 2023 and those trained after. We find that for the same amount of pretraining compute, newer models strongly outperform older ones, on average by 7 percentage points in MMLU and 19 points in GSM8K. We then finetune all models on the same amount of task-specific data before evaluation. After fine-tuning on the same task data, newer models no longer outperform older ones. Rather, their performance equalizes. See Figure 1. This outcome suggests that the main difference between newer and older models is the extent to which they train on the test task.
Next, we present compelling evidence that "training on the test task" is a more significant driver of benchmark performance than data contamination. We examine ARC Challenge and HellaSwag benchmarks, where initially, newer models show no discernible advantage over older models. However, when these benchmarks are reformulated as MMLU-style multiple-choice question answering (MCQA) tasks, we observe the same confounding effects as seen with MMLU (Section 3.2, Figure 4). This crucial finding indicates that the performance gains of newer models on MMLU are likely not attributable to the memorization of specific test data, but rather to an enhanced proficiency in MCQA tasks.

Furthermore, we illustrate how training on the test task distorts comparisons between different model families. Certain families appear significantly superior before adjusting for test task training, but this perceived advantage vanishes after our proposed adjustment (Section 4.1, Figure 6). We also demonstrate that training on the test task has inflated the reported progress in model capabilities over time. After accounting for its effects, newer models show only modest improvements to the Pareto frontier of model performance relative to pre-training compute.

Finally, we reveal the profound implications of training on the test task for the study of emergent capabilities. We show that the phenomenon of emergence gradually disappears as the extent of training on the test task increases (Section 5). Specifically, capabilities become observable and predictable at much smaller model scales, leading to the recovery of cleaner log-linear scaling. Importantly, our adjustment proves effective even in cases, such as MMLU, where previous explanations for emergence (e.g., choice of evaluation metric) are insufficient.

Our work advocates for a fundamental reorientation of large language model evaluation. We assert that model comparisons and claims of emergence are heavily confounded by the relationship between training data and test tasks. When comparing models with diverse pre-training data, we recommend standardizing evaluation by providing each model with the same, sufficient amount of task-relevant fine-tuning before assessment.

Section: ADJUSTING FOR TRAINING ON THE TEST TASK
We choose MMLU (Hendrycks et al., 2020) and GSM8K (Cobbe et al., 2021) as a case study for investigating training on the test task in active benchmarks. MMLU tests for world knowledge, whereas GSM8K tests multistep mathematical reasoning. These two benchmarks are arguably the most influential of the 2022-2024 period under study. They are also included in the Hugging-Face (HF) Open LLM Leaderboard v1 (Beeching et al., 2023), a popular leaderboard that evaluates and ranks models with publicly available weights. We evaluate models using LM Evaluation Harness (EleutherAI, 2024), in identical fashion to the HF leaderboard1 .
We evaluate 56 base models, ranging in size from 70M to 70B parameters. See Appendix B.1 for the full list. The HF leaderboard's FAQ makes the distinction between "base pretrained models" and instruction-tuned or chat models, arguing that this is necessary to ensure fair model comparisons.
We select models that are categorized as "pretrained". We check that the technical report of each of the selected models makes no mention of the model being fine-tuned. We only consider models for which the number of training tokens is known. This allows us to estimate the total amount of pretraining compute in FLOPs as C ≈ 6 • N • D, where C is pretraining compute, N is the number of model parameters, and D is the number of training tokens.
While we focus primarily on MMLU and GSM8K due to their prominence, we find that the issue of training on the test task extends beyond these two benchmarks. Specifically, in Appendix E, we evaluate and discuss the impact of training on the test task for five additional benchmarks: MMLU Pro (Wang et al., 2024), GPQA (Rein et al., 2023), BBH (Suzgun et al., 2023), MuSR (Sprague et al., 2023), and MATH Level 5 (Hendrycks et al., 2021), which form the OpenLLM Leaderboard v2 (Fourrier et al., 2024a). Furthermore, in Appendix G we conduct similar experiments for an additional 36 instruction and chat models. We observe that our findings generalize remarkably well to instruction and chat models. We selected November 2023 as the temporal cutoff for our analysis because the technical reports of models released from late 2023 onward start referencing certain pre-training practices that may amount to training on test task. For example, Qwen (Bai et al., 2023), Olmo 1.7 (Groeneveld et al., 2024) and MAP Neo (Zhang et al., 2024) explicitly include instruction data during pretraining. StableLM 2 (StabilityAI, 2023) reformulates some of its pretraining datasets to better resemble downstream tasks such as question-answering. More subtly, the pretraining data mixtures of Gemma (Gemma et al., 2024) and Llama 3 (MetaAI, 2024) were determined through extensive ablations on downstream benchmark evaluations. We validate that our findings are robust to adjusting the temporal cutoff by a few months; see Appendix D.1 for details. Choosing specifically the month of November as the cutoff is therefore not critical for our analysis.
This raises an important question: Do newer models outperform older ones mainly because newer models trained more on the test task? At first sight, an answer seems elusive. After all, the pretraining data of most recent models is not publicly available. Retraining all models with the same training data and compute budget would be both infeasible and cost prohibitive. In the next section, we propose a way to get at the answer by adjusting for the effect of training on the test task.

Section: ADJUSTING FOR TRAINING ON THE TEST TASK BY TRAINING ON THE TEST TASK
We propose to adjust for differences in test task training by fine-tuning all models on the same, sufficient amount of task-specific data before evaluation. To do so, we need a source of task-specific data for each of the tasks we consider. For multiple choice questioning answering, we use the auxiliary training set accompanying the HF MMLU repository2 . This training set is not an i.i.d. split of MMLU. Instead, it consists of the training sets from other multiple-choice question-answering benchmarks, comprising approximately 100,000 training examples and 30 million tokens. For mathematical reasoning, we combine MetaMathQA (Yu et al., 2023b) and Orca-Math (Mitra et al., 2024), totalling approximately 600,000 training examples and 200M tokens. We fine-tune models for three epochs using standard hyperparameter choices, see Appendix B.2. The amount of compute required for fine-tuning is minimal compared to models' pretraining compute.
We plot model scores on MMLU and GSM8K after fine-tuning in Figure 1 (bottom). We observe that after fine-tuning on task relevant data, both newer and older models follow remarkably similar scaling trends. That is, newer models no longer appear to outperform older models.
Remarkably, we observe that older models tend to benefit much more from fine-tuning on taskrelevant data compared to newer models, see Figure 2. The improvements in older models are striking, often leaping from random chance accuracy to double-digit gains in accuracy. In contrast, fine-tuning provides comparatively little benefit to newer models. This observation suggests that newer models have already been exposed to a substantial amount of task-relevant data, making additional fine-tuning less impactful.
A potential concern is that our observations might result from our fine-tuning hyperparameters being systematically more favorable to older models. We verify that this is not the case by conducting a robustness check on the fine-tuning hyperparameters, see Appendix B.3.

Section: QUANTIFYING PERFORMANCE DIFFERENCES BETWEEN NEWER AND OLDER MODELS
We draw inspiration from scaling laws (Kaplan et al., 2020) in how we model benchmark accuracy A to scale log-linearly with pretraining compute C. To account for emergence (Wei et al., 2022b), we assume that models perform at the task's random chance accuracy r up to scaling to some point of emergence c e . We let the variable N denote whether a model was trained after November 2023, and regress the model
A = α max(0, log C -c e ) + θN + r + ϵ,(1)
where α, θ and c e are the fit's parameters, and ϵ is random noise. We focus on the coefficient θ, which corresponds to the average difference in benchmark performance between newer and older models after controlling for pretraining compute. We fit the model in Equation 1, and report the regression coefficient θ in Figure 1. We obtain R 2 > 0.9 for all model fits. We use clustered standard errors to compute statistical significance, treating each model family as a separate group.
Before adjusting for test task training, the estimated difference in performance θ between newer and older models are statistically significant, positive, and large. Specifically, recent models outperform older ones on average by over 7 accuracy points in MMLU and 19 accuracy points in GSM8K. These are remarkably large differences in benchmark performance. However, after the adjustment, the estimated coefficient θ is both small and not statistically significant. See Figure 1 bottom. That is, conditioned on all models training on the same amount of task-specific data, we find no evidence for a significant difference in benchmark performance between newer and older models.
Therefore, the performance of newer and older models equalizes when all models are exposed to the same amount of task-relevant data. This suggests that the impressive benchmark improvements of newer models are primarily attributable to newer models training more on the test task. We present a causal interpretation of results in Appendix C, outlying the assumptions necessary to establish a causal link between training on the test task and the benchmark improvements of newer models.

Section: RECREATING DIFFERENCES IN BENCHMARK PERFORMANCE
We have so far established that newer models strongly outperform older models for the same amount of pre-training compute. We now demonstrate how to recreate such differences in performance by actively manipulating how much models train on the test task. We do so in two ways. First, we fine-tune older models on task relevant data (Section 3.1). Second, we reformulate certain test tasks to use MMLU-style multiple choice prompts instead of "cloze" evaluations (Section 3.2). Both experiments recreate the kind of performance differences observed between newer and older models.
These results provide further evidence that the differences in performance between older and newer models are linked to test task training. They also demonstrate how test task training distorts benchmark evaluations. Fortunately, in both cases, we show that fine-tuning models on task-relevant data before evaluation is an effective mechanism for mitigating the bias introduced by training on the test task. In doing so, we systematically validate the proposed adjustment method.

Section: FINE-TUNING ON THE TEST TASK
For this section, we only consider models trained before November 2023. We split the models into two cohorts: a control group and a treatment group. We take these older models as a control group.
We then create a treatment group by fine-tuning the control group on the datasets of task-relevant data introduced in Section 2. We only fine-tune models with at least 7 • 10 21 FLOPs, the pre-training compute of the smallest newer model. We fine-tune for a single epoch. We plot in Figure 3 top the benchmark performance of the two cohorts.
Qualitatively, the differences in performance between the control and treatment groups resembles the differences observed between newer and older models, contrast Figure 3 with Figure 1. Quantitatively, the estimated performance gain θ from fine-tuning is similar to the difference between newer and older models estimated in Section 2.2. That is, fine-tuning older models on the test task produces both qualitatively and quantitatively similar confounding to that observed between newer and older models. This results further supports our running hypothesis that newer models are largely equivalent to older models that have trained on the test task. They also demonstrate the large effect  that training on the test task can have on benchmark performance. Note that the gain in performance of the treatment group is slightly larger than the difference in performance between newer and older models. This is to be expected, as all models in the treatment group are fine-tuned on the test task, whereas not all new models may train on the test task.
We then apply our proposed adjustment by further fine-tuning both the control and treatment groups on the test task, see Figure 3 right. After the adjustment, the estimated difference in performance θ between the control and treatment group is both small and not statistically significant. We therefore validate a vital soundness property of the proposed adjustment procedure: after deliberately training some models on the test task, we can undo their advantage over other models by further training all models on the test task.

Section: REFORMULATING THE TEST TASK
In this section, we show that reformulating other benchmarks as multiple-choice question answering tasks leads to similar differences in performance between older and newer models. We consider two additional benchmarks from the HF leaderboard v1: ARC Challenge (Clark et al., 2018) and Hel-laSwag (Zellers et al., 2019). Similarly to MMLU, ARC comprises grade-school level questions. HellaSwag instead tests for commonsense reasoning. Like MMLU, the questions in ARC and Hel-laSwag are accompanied by four possible answers, among which the model must differentiate the correct one. ARC and HellaSwag use "cloze" evaluations: a models' answer is taken to be that with the largest completion likelihood given the input question. In contrast, MMLU formulates questions as multiple-choice: all four answer choices are listed, and the model is promoted to pick one.
We first evaluate all models on ARC and HellaSwag using the standard cloze evaluation, and plot their benchmark performance in Figure 4 left. We repeat the statistical analysis of Section 2.2. We find that the estimated difference in performance θ between newer and older models is small and not statistically significant: newer models do not outperform older models on ARC and HellaSwag.
We then reformulate ARC and HellaSwag as MMLU-style multiple-choice questions, and plot the resulting benchmark performance in Figure 4 center. We observe large differences in performance between newer and older models. Specifically, we find the difference in performance θ between newer and older models to be significant, positive, and large, and to be roughly similar in magnitude to that estimated for MMLU in Section 2.2. That is, reformulating the test task as multiple choice question answering leads to qualitatively and quantitatively similar confounding to that observed for MMLU. Therefore, newer models overperform on MMLU likely not because of memorization of specific testing data (i.e., due to data contamination or leakage), but rather due to an improved ability for multiple-choice question answering.  Lastly, we adjust for test task training by fine-tuning all models on the MMLU auxiliary training set, and plot their ARC Challenge and HellaSwag scores in Figure 4 right. We no longer find evidence of a large nor a significant difference in performance between newer and older models. Therefore, the proposed adjustment is effective in mitigating the bias introduced by evaluating models via multiple-choice question answering. Notably, performance on ARC and HellaSwag equalizes after fine-tuning on the MMLU auxiliary training set. This indicates that the adjustment data need not closely resemble the test set, but rather the test task.
What does MMLU test for? We evaluate MMLU using the "cloze" methodology instead of the usual multiple-choice prompts. We plot the results in Figure 5 center. With cloze evaluations, the difference in performance between newer and older models is both small and not statistically significant. This suggests that the standard MMLU evaluation conflates knowledge-testing with testing a models' ability to answer multiple choice questions3 . Newer models therefore attain higher MMLU scores than older models largely because they are better at multiple-choice question answering, and not because they necessarily "know more".

Section: IMPLICATIONS FOR MODEL COMPARISONS
So far, we have shown how training on the test task distorts benchmark evaluations. Next, we examine its impact on the relative comparison of model families (Section 4.1) as well as its implications for accurately measuring progress in model capabilities over time (Section 4.2).  After adjustment, the area of improvement (green) reduces by a sixfold.

Section: COMPARING MODEL FAMILIES
We compare the performance of the Pythia, Llama 2, and Qwen 1.5 model families, which likely train on the test task to very different extents. Pythia was trained on the Pile (Gao et al., 2020), a collection of curated datasets that are unlikely to contain much test task data. Llama 2 was trained mostly on web data, which is reasonable to assume may contain more test task data. Lastly, Qwen 1.5 explicitly includes instruction data in its pretraining mixture, thus likely training on the test task.
We plot the MMLU and GSM8K scores of the three model families in Figure 6, as well as their adjusted scores (i.e., after fine-tuning on task relevant data). Without adjustment, Qwen 1.5 appears to be the superior model family: it Pareto dominates both the Llama 2 and Pythia models. In contrast, all Pythia models perform no better than random chance, making it unclear whether scaling Pythia offers any benefit at all. After adjustment, however, all three model families exhibit remarkably similar scaling trends. Therefore, after correcting for the confounding of test task training, none of the model families appears superior to the others.
Training on the test task therefore profoundly confounds relative model comparisons. Base models are rarely used "as is" and are generally adapted before deployment. Because of the confounding of training on the test task, performance before adaptation may not reliably predict performance after adaptation. It therefore makes little sense to compare base models at face value.

Section: PROGRESS IN MODEL CAPABILITIES
Training on the test task substantially overestimates the progress in benchmark performance per unit of compute achieved by recent model families. In Figure 7 we plot the Pareto frontier of benchmark accuracy against pretraining compute, both for models trained before November 2023 and for all models. We measure progress by considering the area of improvement of the Pareto frontier since November 2023, shaded in green. Without adjustment, the difference between the two Pareto frontiers is large, indicating very substantial progress since November 2023. After adjustment, however, the area of improvement reduces by a sixfold, showing only modest improvements. Therefore, training on the test task strongly overestimates the progress in benchmark performance per unit of compute achieved by recent model families.
On the other hand, recent models tend to be trained on more data than Chinchilla computeoptimal (Hoffmann et al., 2022). Given the Chinchilla scaling laws, it is noteworthy that newer, smaller "over-trained" models match the performance of older, larger ones for the same amount of pretraining compute. Since inference and fine-tuning of smaller models is substantially cheaper, recent models can be much more accessible to less well-resourced institutions, with little cost in performance. For example, we find that Llama 3 8B closely matches the performance of Llama 2 70B (both have similar pre-training compute).

Section: IMPLICATIONS FOR EMERGENCE
Throughout our evaluations, we observe emergent behavior for MMLU and GSM8K: models perform at near random chance up to a certain scale of pretraining compute, followed by relatively sharper improvements in performance at larger scales (Wei et al., 2022b). After training on the test task, however, emergence for MMLU and GSM8K appears to occur at substantially lower scales. We dedicate this section to more closely investigating the relationship between training on the test task and emergence.
Emergence arises at lower scales with increased test task training. We consider only models trained before November 2023, as we have established that these models train on the test task less than newer models. We evaluate the models at intermediate checkpoints as we fine-tune them on the datasets of task relevant data introduced in Section 2.1. We fit α and c e in Equation 1to the different intermediate checkpoints, and report in Figure 8 top the corresponding points of emergence c e . We find that emergence arises at increasingly lower compute regimes as models train on the test task. For MMLU, models exhibit emergence at around 10 22 FLOPs, the scale of Pythia 6.9B. After training on 64,000 examples, emergence arises around 6 • 10 20 FLOPs, the scale of Pythia 410M. We observe similar results for GSM8K, see Figure 19 in Appendix F.
Training on the test task yields increasingly better log-linear fits. The log-linear relationship between pretraining loss and compute is well-established (Kaplan et al., 2020). We observe that training on the test task increasingly recovers log-linear scaling between pretraining compute and benchmark accuracy. Similarly to the earlier section, we evaluate intermediate checkpoints but instead fit log-linear functions in Figure 8 bottom. We observe that the R 2 of the fit improves substantially as the models train on more task-relevant data, jumping from 0.63 to 0.95 after training on 64,000 examples. Therefore, after training on the test task, almost all the variation in benchmark accuracy is explained by log-linear scaling of pre-training compute. We observe similar results for GSM8K, see Figure 19 in Appendix F.
Recommendations. Schaeffer et al. (2024a) argue that emergence appears due to the choice of metric. To mitigate emergence, they suggest considering Brier score instead of accuracy. We observe, however, that emergence for MMLU does not disappear when using the Brier score, see Figure 5 right, nor that of ARC and HellaSwag when framed as multiple-choice questions, see Figure 18 in Appendix F. We discuss two practical solutions to obtain predictive scaling while maintaining accuracy as the evaluation metric.
For MMLU and multiple-choice benchmarks more broadly, cloze evaluations consistently yield smoother and more predictable scaling, even when using accuracy as the evaluation metric. Since the purpose of these benchmarks is knowledge-testing more so than testing multiple-choice answering ability, cloze evaluations are preferable insofar predictive scaling at lower compute scales is an important consideration. This recommendation aligns with the concurrent work by Gu et al. (2024).
More broadly, if sufficient task relevant data is available, then training on the test task can result in much more predictable scaling by shifting emergence to smaller compute scales. That is, scaling laws where models across scales are fine-tuned on the same, sufficient task-relevant data before evaluation. Such scaling laws correspond to those of specialist models, which for some tasks -e.g., legal annotation (Dominguez-Olmedo et al., 2024)-or purposes -e.g., safety-might be preferable to the scaling laws of generalist models.

Section: DISCUSSION
The 1968 Olympics took place in Mexico City at the significant altitude of 2340 meters, higher than Australia's tallest peak. Runners who had trained at altitude in their home countries were better prepared to compete in Mexico City's conditions, as it turned out. But the hotly debated results of the Games did not lead the organizers to prohibit training at natural altitude. Instead, they let everyone do it, and athletes came to consider altitude training an excellent way to train.
The anecdote holds a lesson for the evaluation of large language models half a century later. Knowledge about the evaluation conditions necessarily influences training practices under competitive pressure. It may be a fool's errand to prohibit the practice. Instead, we propose to adjust for it by giving every model the same task-specific preparation before evaluation. We work from the assumption that training on the test task, in general, cannot be effectively detected, disallowed, or disincentivized. Detecting what training data a model has seen is a notoriously difficult problem -existing heuristics achieve partial success at best. Researchers routinely acknowledge the futility of fighting data contamination. Moreover, we anticipate that the ways to effectively train on the test task will only grow in scope and adoption.
Our work demonstrates that comparisons of different models are confounded by the choice of training data and training practices. Different model families vary in the degree that they wereimplicitly or explicitly-trained on various test tasks. A relatively small amount of task data can have a disproportionally large effect on benchmark performance. It therefore makes little sense to compare model performance at face value without accounting for how the training data relate to the test task.
Training on the test task also has profound implications for the study of emergent behavior. After training on the test task, model capabilities become predictable at smaller scales. This greatly reduces the unpredictability associated with emergence, notably without any change in the metric.
Despite the daunting challenges that training on the test task poses for the fair evaluation of language models, it's also its own best remedy. Giving each model the same sufficient task-specific finetuning harmonizes model comparisons and linearizes the relationship between model capabilities and pretraining compute. We hope that our work informs stronger evaluation standards that address central challenges in the current evaluation ecosystem.

Section: A RELATED WORK
Benchmarks have played a central role in both machine learning (Hardt & Recht, 2022) and natural language processing (Storks et al., 2019). Classically, benchmarks comprised both a test set and a reasonably large training set (Garofolo et al., 1993;LeCun et al., 1998;Sang & De Meulder, 2003;Koehn, 2005;Deng et al., 2009). Models were trained on the same training set, and then evaluated on the accompanying test set. The success of unsupervised language modelling (Peters et al., 2018;Kenton & Toutanova, 2019;Radford et al., 2019), however, has changed this paradigm. Firstly, present-day language models differ in their training data, which is not standardized but rather treated as a design choice (Raffel et al., 2020;Albalak et al., 2024;Li et al., 2024). Secondly, language models are a priori not trained with the explicit objective of maximizing any single benchmark.
Data contamination. Data contamination or test-set contamination refers to any overlap between the training and the test data such that test results overestimate a model's generalization performance. The scale and often little curation of present-day pretraining corpora exacerbates data contamination concerns in language model evaluations (Jiang et al., 2024). Consequently, data contamination is usually discussed in the technical reports accompanying model releases (Radford et al., 2019;Brown et al., 2020;Chowdhery et al., 2023;Touvron et al., 2023b). However, detecting and preventing data contamination is currently an open problem (Gunasekar et al., 2023;Yang et al., 2023b;Golchin & Surdeanu, 2023). Roberts et al. (2023) and Li & Flanigan (2024) find that models often perform better on datasets that were publicly available during model training. While almost all models that we consider were released after MMLU and GSM8K, we nonetheless find that, controlling for compute, more recent models perform better. These performance gains are unlikely to be driven solely by test set leakage and require additional explanation. In Section 3.2, we find evidence that that training on the test task may be a more dominant factor in benchmark performance than data contamination. A key insight of our work is that models can train on the test task (e.g., multiple-choice question answering) and do much better at a given benchmark (e.g., MMLU) without necessarily seeing any benchmark data. This defies traditional notions of data contamination, which strictly refer to the leakage of benchmark data (Magar & Schwartz, 2022;Sainz et al., 2023;Dong et al., 2024). Moreover, whereas the core concern with data leakage is that benchmarks may inadvertently find their way to the training data, training on the test task is often a deliberate design choice (e.g., pre-training on instruction data).
Adaptation prior to evaluation. In the 2010s, language models had to be adapted to different benchmarks using supervised task data (Collobert et al., 2011;Dai & Le, 2015;Devlin et al., 2019). The purpose of such fine-tuned benchmark models was solely to facilitate the relative comparison of base models. Importantly, all models were adapted using the same supervised data (Collobert et al., 2011). With GPT-3 (Brown et al., 2020), few-shot prompting emerged as the dominant paradigm for adapting models to a particular task prior to evaluation (Liang et al., 2023), arguably due to its simplicity relative to fine-tuning. Bommasani et al. (2021) argue that benchmark evaluations should account for the adaptation resources used by each model (e.g., the adaptation data). Similarly, Liang et al. (2023) argue that the strategy for adapting the models to the benchmark evaluation should be controlled for. Clearly, few-shot prompting is a weaker form of adaptation than training on hundreds of thousands of task examples, as newer models often do. Our proposal of fine-tuning on the same, sufficient amount of task-specific data prior to evaluation aims to effectively control for model adaptation, ensuring that all models are given equal adaptation resources.
Training on the test task. The effectiveness of fine-tuning on the training set accompanying LLM benchmarks is well-known (Wei et al., 2022a;Wang et al., 2022;Chung et al., 2024). Consequently, many influential instruction-tuning datasets contain or are partly derived from benchmark train data (Wei et al., 2022a;Honovich et al., 2022;Mukherjee et al., 2023). Li & Flanigan (2024) identify small amounts of benchmark-specific data in the publicly available Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023) instruction-tuning sets. Zhou et al. (2023b) empirically analyze the effects of fine-tuning on benchmark-specific data and warn about its impacts on benchmark validity. To circumvent these issues, recent work has focused on indirect indicators of broader data contamination, such as a lack of robustness to task transformations (Wu et al., 2023), or underperformance on benchmarks with novel task combinations (Yu et al., 2023a). In contrast, we find evidence for training on the test task without the need for explicitly identifying specific data points used at training time, or modifying tasks. In addition, our proposed method of fine-tuning on task data before evaluation allows us to quantify and correct for the effects of training on the test task on benchmark performance.
Emergent abilities of language models. Emergent capabilities (Wei et al., 2022b;Ganguli et al., 2022) refer to levels of model performance at large scales that cannot be easily predicted by extrapolating from smaller scales. Wei et al. (2022b) report emergent capabilities for various benchmarks including MMLU and GSM8K (Srivastava et al., 2022). However, Srivastava et al. (2022); Schaeffer et al. (2024b) find that the log-probability of the correct answer often improves smoothly, even when other metrics seem to show emergence. Rogers & Luccioni (2024) question the dominant definition of emergence and emphasize the importance of relating the training data to the test data before making claims about emergence. Lu et al. (2023) argue that most emergent capabilities can be explained by in-context-learning. Gadre et al. ( 2024) find that a model's perplexity on its pre-training data reliably predicts its average downstream performance. However, their analysis does not include the MMLU and GSM8K benchmarks, which are central to our work. Schaeffer et al. (2024a) argue that emergent capabilities are mostly an artifact of non-linear and discontinuous evaluation metrics like accuracy. In contrast, we find signs of emergence on MMLU even when using continuous metrics like the Brier score. Similarly to our findings, Snell et al. (2024) show that increasingly fine-tuning on the test task shifts the point of emergence to smaller compute scales.

Section: B ADDITIONAL EXPERIMENTAL DETAILS B.1 MODELS CONSIDERED
Model size in billions of parameters is indicated by N and pretraining data size in trillions of tokens is indicated by D. Model weights were retrieved from the corresponding HuggingFace (HF) repositories.  -7 ]. The left plot shows MMLU performance after fine-tuning with the original learning rates detailed in Appendix B.2. The right plot shows MMLU performance after fine-tuning but using for newer models the optimal learning rate from the sweep that maximizes MMLU performance. Even with the advantage of a higher hyperparameter search budget for the newer models, the estimated effect size θ of model recency on benchmark performance remains both small and not statistically significant.

Section: Name
experiments, we perform minimal hyperparameter tuning and use standard hyperparameter choices throughout. We use a learning rate of 2 • 10 -5 for models with fewer than 10B parameters and a learning rate of 2 • 10 -6 for models with more than 10B parameters. For Qwen 2 as well as four of the 7B models -Gemma 7B, Olmo 7B, Olmo 1.7 7B, and Llama 3 8B-benchmark accuracy degraded after fine-tuning. For these models, we use a peak learning rate of 2 • 10 -6 instead. We use a cosine learning rate schedule with linear warm-up for 50 steps and decay to 10% of the peak learning rate. We use AdamW (Loshchilov & Hutter, 2018) as the optimizer, with β1 = 0.9, β2 = 0.95, and ϵ = 10 -8 . We fine-tune with batch size 64. We use a weight decay rate of 0.1 and clip gradients at 1.0. We verify that the training loss decreases for all models on both of the fine-tuning datasets. To reduce the computation burden of fine-tuning, we train with context size 600. We verify that less than 5% of the fine-tuning examples have context length above 600.
We use an internal cluster of A100 and H100 GPUs. Fine-tuning all models required approximately 10,000 H100 GPU hours, whereas evaluating all models in the different benchmarks required approximately 400 H100 GPU hours.

Section: B.3 ROBUSTNESS CHECK ON THE HYPERPARAMETER SEARCH BUDGET
We found the learning rate to be, by a large margin, the single most impactful hyperparameter. We perform a sweep on the MMLU auxiliary training set with the following learning rates: [6•10 -5 , 2•10 -5 , 6•10 -6 , 2•10 -6 , 6 • 10 -7 ]. We are unable to perform more extensive hyperparameter sweeps due to their large computational cost. We find that no models benefit from a smaller learning rate than 2 • 10 -6 , and only one model benefits from a larger learning rate than 2 • 10 -5 , Pythia 70M (the model with smallest pre-training compute). Thus, for every model except for Pythia 70M, the optimal learning rate is inside the boundary of the hyperparameter sweep.
A potential concern is that older models may appear to match the performance of newer models because the finetuning hyperparameters could be systematically more favorable to the older models. To address this concern, we recreate our main experiment by fine-tuning older models with their original learning rates, while selecting for the newer models the optimal learning rate in the sweep that leads to highest MMLU performance. That is, we deliberately give newer models a systematic advantage in terms of hyperparameter search budget. We plot the results in Figure 9. The estimated effect size of model recency on benchmark performance shifts from θ = -0.005 to θ = 0.005. Therefore, despite giving newer models a systemic advantage in terms of hyperparameter search budget, the effect size remains both remains small and not statistically significant.

Section: C CAUSAL INTERPRETATION OF OUR FINDINGS
In Section 2.2 we established that models trained after November 2023 significantly outperform those trained before November 2023 for both MMLU and GSM8K. We then showed that fine-tuning all models in the test task equalizes the performance of newer and older models. We now present a causal interpretation of our The key obstacle to our analysis is that test task training T is unobservable. Firstly, because practitioners are typically not transparent about their design choices, including the pretraining data. Secondly, because the extent to which different training practices might amount to test task training is unclear. Nonetheless, by fine-tuning on task-specific data, we can intervene on the extent to which models train on the test task.
Figure 10 summarizes our causal assumption. The time at which a model was trained determines the design choices made, such as its pretraining data or pretraining compute C. These design choices in turn affect how much the model trains on the test task. All these factors ultimately influence the pretrained model and thus its benchmark performance. We also admit that compute might influence test task training. For instance, pre-training on larger datasets may lead to models training more on the test task.
We interpret the proposed adjustment method as intervening on the test task training variable T . Namely, by fine-tuning all models on the same amount of task-specific data before evaluation. The external validity of our subsequent analysis hinges on the assumption that our controlled experimental setting -fine-tuning models after the pretraining stage-is reasonably similar to the natural settings in which practitioners might train on the test task during pretraining (e.g., by including instruction data in the pretraining data mixture). We provide evidence for this in Appendix D.3.
We model fine-tuning as a hard intervention do(T = t) (Pearl, 2009). The specific magnitude of the intervention t need not be quantified. Instead, the key assumption is that by fine-tuning on the same, sufficient amount of task data, all models will have received the same amount of test task training. Since some models may have already trained on the test task before fine-tuning, this assumption only holds if test task training saturates, and we train on enough task data to reach saturation. The fact that our task-specific datasets allow older models to match the performance of newer models provides some evidence that we train on enough task-specific data to reach saturation.
We draw inspiration from scaling laws (Kaplan et al., 2020) and model the relationship between pretraining compute and its causal descendants as piecewise log-linear:
f (C, α) = α0 + |α| i=1 αi log C • [C > ci](2)
For simplicity, we consider three fixed knots at c1 = 0, c2 = 10 22 , and c3 = 10 23 FLOPs. We assume all other variable relationships to be linear, resulting in the structural assignments:
T := f (C, β) + ϕN + δ, δ ∼ N (0, σ 2 δ )(3)
A := f (C, α) + ψN + γT + η + ϵ, ϵ ∼ N (0, σ 2 ϵ )(4)
We denote benchmark accuracy after fine-tuning as A| do(T =t) . To estimate the direct effect N → A of model recency on accuracy, we regress the linear model
A| do(T =t) = f (C, α) + ψN + γt + η + ϵ = f (C, α) + ψN + η ′ + ϵ, η ′ = η + γt(5)
where α, ψ, η ′ are the fit's parameters and ϵ is random noise. The coefficient ψ corresponds to the direct effect N → A of model recency on benchmark accuracy. We additionally regress on the difference in accuracy pre-  
A -A| do(T =t) = (f (C, α) + ψN + γT + η + ϵ1) -(f (C, α) + ψN + γt + η + ϵ2) = γT -γt + ϵ1 -ϵ2 = f (C, γβ) + γϕN + γδ -γt + ϵ1 -ϵ2 = f (C, β ′ ) + ϕ ′ N + b + ϵ ′ , for β ′ = γβ, ϕ ′ = γϕ, b = -γt, ϵ ′ = ϵ1 -ϵ2 + γδ(6)
where β ′ , ϕ ′ , b are the fit's parameters and ϵ ′ is random noise. The coefficient ϕ ′ corresponds to the indirect effect N → T → A of model recency N on benchmark accuracy A mediated by test task training T (Pearl, 2013). That is, the improvements in accuracy of recent models attributable to training on the test task.
We fit the models in Equation 5and Equation 6, and we report the coefficients pertaining to N → A and N → T → A in Table 4 and Table 3. We find that the indirect effect N → T → A of model recency on accuracy mediated by test task training T is significant, positive, and large. In contrast, we find no evidence of a significant direct effect N → A of model recency on accuracy. We therefore find no evidence of the improvements of newer models being attributable to anything else other than training on the test task.
In conclusion, our causal analysis indicates that the differences in MMLU and GSM8K performance between newer and older models observed in Section 2.1 are largely attributable to differences in test task training. That is, the mechanism by which newer models outperform older models is primarily by training more on the test task.

Section: D ROBUSTNESS CHECK ON THE TEMPORAL SPLIT D.1 ADJUSTING THE TEMPORAL CUTOFF BY A FEW MONTHS
We repeat the analysis of Section 2 for two additional temporal splits: September 2023 and January 2024, and present the results in Figure 11 and Figure 12, respectively. Our results are robust to shifting the temporal cutoff by a few months. That is, our findings indicate that practitioners started adopting design choices around late 2023 that amount to models training on the test task much more, which is consistent with models' technical reports starting to mention the use of benchmark or instruction data at pre-training time. Choosing specifically the month of November as the cut-off is therefore not critical for our analysis.

Section: D.2 EN VS CN LANGUAGE DATA
Instead of diving models using a temporal split, we divide models based on whether they were trained primarily on English (EN) data or on a mixture of English and Chinese (EN+CN) language data. While there is a considerable overlap between the temporal split and the EN/EN+CN model split, there are notable differences.
In particular, the Baichuan, Baichuan 2, and InternLM, and Skywork families were trained before November 2023 and trained on EN+CN data. Conversely, Gemma, Llama 3, StableLM 2, Falcon 2, and Olmo were trained after November 2023 and trained on EN data.
We repeat the analysis of Section 2 for the EN and EN+CN model split, see Figure 13. We observe that, controlling for pretraining compute, models trained on EN+CN language data outperform those trained primarily on EN by 9 accuracy points on MMLU and 12 accuracy points on GSM8K. After the proposed adjustment, however, the difference in performance between models trained on EN data and EN+CN data is small and not statistically significant.   The confounding and measured effect sizes for the EN and EN+CN model split resemble those obtained for the temporal split, which we interpret as a valuable robustness check of our results.

Section: D.3 HOW SIMILAR ARE NEWER MODELS TO OLDER, FINE-TUNED MODELS?
In Section 3.1 we fine-tune older models on the test task, and we demonstrate that the differences in benchmark performance between the fine-tuned and non fine-tuned models resemble those between newer and older models. In this section we provide further evidence that newer models resemble older, fine-tuned models.
We take the older models and we fine-tune them with 64,000 training examples from the auxiliary training sets introduced in Section 2.1. We plot in Figure 14 the benchmark scores of the older, fine-tuned models as well as that of the newer models. We qualitatively observe that both the older, fine-tuned models and the newer models exhibit similar scaling. That is, older fine-tuned models resemble newer models in terms of performance per compute.
We perform a quantitative analysis consisting in discriminating between the older models and the newer models based on their pretraining compute and benchmark accuracy. That is, we construct a tabular dataset where rows    are models and columns are their corresponding pretraining compute, benchmark accuracy, and whether the model was trained after November 2023. We then train a classifier aiming to predict model recency from compute and accuracy. Intuitively, if the performance of older models is very different from that of newer models, then we would obtain high prediction accuracy (i.e., the two classes are highly separable). Note that prediction accuracy provides a lower bound on the total variation (TV) distance between the distributions of compute and accuracy of older and newer models.
We train XGBoost classifiers and report balanced accuracy for leave-one-out cross-validation in Table 5. We obtain close to random-chance accuracy in discriminating between older, fine-tuned models and newer models. That is, older fine-tuned models are indistinguishable from newer models in terms of their performance per pre-training compute.

Section: E RESULTS FOR THE OPENLLM LEADERBOARD V2
HuggingFace released on June 2024 a revision of the OpenLLM Leaderboard (Fourrier et al., 2024a). The HF leaderboard v2 differs from v1 in the six benchmarks it considers: MMLU Pro (Wang et al., 2024), GPQA (Rein et al., 2023), BBH (Suzgun et al., 2023), MuSR (Sprague et al., 2023), the Level 5 subset of MATH (Hendrycks et al., 2021), and IFEval (Zhou et al., 2023a). MMLU and GPQA test for knowledge and are framed as multiplechoice questions. BBH and MuSR test for reasoning. MATH tests for mathematical reasoning. IFEval tests the ability of models to follow instructions.
The creators of the OpenLLM Leaderboard cite contamination as a key motivation for releasing the v2 revision. They note that a key criterion in choosing the benchmarks of the HF leaderboard v2 was lack of contamination in models as of today. In particular, Fourrier et al. (2024b) claim that current models are not contaminated for GPQA, MuSR, and MMLU Pro: GPQA due to the gating of the test set, and MuSR and MMLU Pro due to their "youth". Fourrier et al. (2024b) succinctly express their concern as regards to data contamination in the HF leaderboard v1:
"Some newer models also showed signs of contamination. By this, we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting the general performance of the model and started to overfit on some evaluation datasets instead of reflecting the more general performance of the task being tested. This was, in particular, the case for GSM8K and TruthfulQA, which were included in some instruction fine-tuning sets."
Note that "models were possibly trained on benchmark data or on data very similar to benchmark data" encompasses not only test set contamination but more broadly training on the test task.
We evaluate all models on MMLU Pro, GPQA, BBH, MuSR and MATH Lvl 5. We use the LM Evaluation Harness library in an identical fashion to the HF leaderboard v2. We do not evaluate on IFEval since it tests for instruction following and we evaluate base models. We additionally evaluate the models that we fine-tuned in Section 2.1 for multiple choice question answering and mathematical reasoning. This gives us models' adjusted benchmark scores after training on multiple choice question answering and mathematical reasoning. For MATH Lvl 5, we use the models fine-tuned on mathematical data, whereas for MMLU Pro, GPQA, BBH and MuSR we use the models fine-tuned on multiple choice question answering. The fine-tuning datasets were not adapted to the new benchmarks in the HF leaderboard v2, thus giving a valuable insight into how well these task-relevant datasets generalize beyond MMLU and GSM8K.
We plot in Figure 15 models benchmark scores pre-and post-post adjustment. We find that newer models significantly outperform older ones in all five benchmarks after controlling for pretraining compute. The differences in performance are smaller in absolute terms than those measured for MMLU (0.073) and GSM8K (0.191). This is in part because these benchmarks are "harder", meaning also smaller differences in performance between the best and worst model. For this reason, we also report the difference between newer and older models relative to the difference between the best and worst model. This relative difference is 13.7% for MMLU Pro, 14.5% for GPQA, 12.1% for MuSR, 9.7% for BBH, and 10.0% for MATH Lvl 5, compared to 15.3% for MMLU and 25.0% for GSM8K. Therefore, newer models overperform in MMLU Pro, GPQA and MuSR about as much as they do for MMLU, and somewhat less for BBH and MATH Lvl 5.
Fine-tuning on task-relevant data reduces the difference in performance between newer and older models for all five benchmarks. Therefore, we find evidence that training on the test task plays a substantial role in newer models outperforming older ones in the benchmarks of the HF Leaderboard v2. For GPQA and MuSR, the difference in performance after adjustment is very small (| θ| ≤ 0.002) and not statistically significant. For BBH, the estimated difference in performance θ reduces by 40% to 0.015 and is no longer statistically significant. For MMLU Pro and MATH Lvl 5 the difference reduces by 19% and 33% respectively but remains reasonably large ( θ ¿ 0.01).
One possible reason for the fact that the adjustment for MMLU Pro and MATH Lvl 5 is not as effective as for MMLU and GSM8K is that the fine-tuning examples are simply not as relevant for MMLU Pro and MATH Lvl 5. For example, the questions and answers in MATH Lvl 5 contain much more LaTeX equation formatting than our mathematical reasoning fine-tuning dataset. Similarly, our multiple choice fine-tuning dataset contains mostly questions with 4 answer choices, whereas all MMLU Pro questions have 10 answer choices. Thus, models are primarily fine-tuned to answer "A", "B", "C", and "D" but not "E", "F", "G". We modify MMLU Pro to contain questions with 4 answer choices by randomly discarding 6 of the incorrect answer choices. We Figure 15: Results for the OpenLLM Leaderboard v2. For all benchmarks, models trained after November 2023 significantly outperform models trained before November 2023 when controlling for pretraining compute. After fine-tuning models on multiple choice question answering and mathematical reasoning, differences in performance between newer and older models reduce for all five benchmarks. These differences are no longer significant for GPQA, MuSR and BBH, but remain significant for MMLU Pro and MATH Lvl 5. evaluate models pre-and post-adjustment and plot the results in Figure 16. We observe that the difference in performance between newer and older models after adjustment reduces from 0.024 to 0.016, and is no longer statistically significant. This observation suggests that fine-tuning one more relevant task-data might further reduce the gap between newer and older models in MMLU Pro and MATH Lvl 5.
Discussion. Fourrier et al. (2024b) cite newer models overperforming in the HF leaderboard v1 due to being "possibly trained on benchmark data or on data very similar to benchmark data" as a major reason for the HF leaderboard v2 revision. We, however, find evidence that training on the test task is also a confounder for the newly included benchmarks. Specifically, the difference in performance between newer and older models is Figure 16: We modify MMLU Pro to only contain questions with 4 answer choices by for every question randomly discarding 6 of the incorrect answer choices. After adjustment, the difference in performance θ between newer and older models is smaller and no longer statistically significant. significant for MMLU Pro, GPQA, MuSR, BBH and MATH Lvl 5, and these differences reduce after adjusting by fine-tuning on the test task. Fourrier et al. (2024b) explicitly highlight GPQA and MuSR as benchmarks likely unaffected by contamination, the former due to being gated and the latter due to its "youth". Not only do newer models significantly outperform older ones in GPQA and MuSR, but these differences in performance fully vanish after fine-tuning on the test task. That is, newer models likely overperform in GPQA and MuSR precisely due to training on the test task.
These findings highlight that training on the test task is a distinct phenomenon from test set leakage. Strategies that aim to mitigate data contamination -e.g., dynamic benchmarks-might not be effective in mitigating the confounding effect of training on the test task. In contrast, we extensively demonstrated the effectiveness of our proposed adjustment procedure, that is, fine-tuning on sufficient task-relevant data before evaluation.

Section: F ADDITIONAL FIGURES


Section: MMLU rankings
Training on the test task significantly alters model rankings on MMLU, with an average shift of 4.8 ranks and a maximum shift of 16 ranks.
Reformulating ARC and HellaSwag as multiple choice In Figure 18 we show that ARC and Hel-laSwag do not exhibit emergence when using the standard cloze evaluation. When reformulating the task as multiple choice in the style of MMLU, however, we observe emergence around 10 22 to 10 23 FLOPs, similarly to MMLU. Emergence in this range of compute persists even when changing the evaluation metric from accuracy to Brier score -a continuous metric-, as suggested by Schaeffer et al. (2024a).
Emergence for GSM8K as models train on the test task Similar to MMLU, we find that increasingly fine-tuning models on mathematical reasoning makes the phenomenon of emergence gradually disappear, see Figure 19. The point of emergence arises at increasingly lower scales, recovering cleaner log-linear fits.  

Section: G RESULTS FOR INSTRUCTION-TUNED AND CHAT MODELS
We evaluated 36 instruct and chat models, see Appendix B.1.1. Our findings for base models presented in the main text generalize remarkably well to instruction-tuned and chat models, see Figure 20. Newer instruct/chat models substantially outperform older instruct/chat models. However, after fine-tuning all models on the same amount of task-specific data, performance between newer and older instruct/chat models equalizes.
We find that the performance gap between newer and older instruct/chat models is smaller than the gap between newer and older base models, contrast the estimated effect sizes in Figure 1 with Figure 20). This is perhaps to be expected, as instruction-tuning datasets usually include some amount of benchmark data.
We posit that the gap between newer and older instruct/chat models is nonetheless large because early-tomid 2023 instruction-tuned variants -e.g., Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), Llama 2 Chat (Touvron et al., 2023b)-did not emphasize benchmark performance but rather human preference (e.g., "win-rate") in a chat setting. For example, the Llama 2 technical report (Touvron et al., 2023b) includes no benchmark evaluations for their chat models. This perspective has dramatically shifted in the last year and a half, and post-training interventions now explicitly aim to improve benchmark performance (MetaAI, 2024;Gemma et al., 2024;Lambert et al., 2024).  


References:
[b0] Alon Albalak; Yanai Elazar; Sang Michael Xie; Shayne Longpre; Nathan Lambert; Xinyi Wang; Niklas Muennighoff; Bairu Hou; Liangming Pan; Haewon Jeong (2024). A survey on data selection for language models. 
[b1] Ebtesam Almazrouei; Hamza Alobeidli; Abdulaziz Alshamsi; Alessandro Cappelli; Ruxandra Cojocaru; Mérouane Debbah; Étienne Goffinet; Daniel Hesslow; Julien Launay; Quentin Malartic (2023). The falcon series of open language models. 
[b2] Jinze Bai; Shuai Bai; Yunfei Chu; Zeyu Cui; Kai Dang; Xiaodong Deng; Yang Fan; Wenbin Ge; Yu Han; Fei Huang (2023). . 
[b3] Edward Beeching; Clémentine Fourrier; Nathan Habib; Sheon Han; Nathan Lambert; Nazneen Rajani; Omar Sanseviero; Lewis Tunstall; Thomas Wolf (2023). Open LLM leaderboard. Hugging Face. 
[b4] Marco Bellagente; Jonathan Tow; Dakota Mahan; Duy Phung; Maksym Zhuravinskyi; Reshinth Adithyan; James Baicoianu; Ben Brooks; Nathan Cooper; Ashish Datta (2024). Stable lm 2 1.6 b technical report. 
[b5] Stella Biderman; Hailey Schoelkopf; Quentin Gregory Anthony; Herbie Bradley; O' Kyle; Eric Brien; Mohammad Hallahan; Shivanshu Aflah Khan;  Purohit; Edward Usvsn Sai Prashanth;  Raff (2023). Pythia: A suite for analyzing large language models across training and scaling. PMLR
[b6] Rishi Bommasani; Drew A Hudson; Ehsan Adeli; Russ Altman; Simran Arora;  Sydney Von Arx; Jeannette Michael S Bernstein; Antoine Bohg; Emma Bosselut;  Brunskill (2021). On the opportunities and risks of foundation models. 
[b7] Tom Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared D Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell (2020). Language models are few-shot learners. Advances in neural information processing systems
[b8] Zheng Cai; Maosong Cao; Haojiong Chen; Kai Chen; Keyu Chen; Xin Chen; Xun Chen; Zehui Chen; Zhi Chen; Pei Chu (2024). Internlm2 technical report. 
[b9] Wei-Lin Chiang; Zhuohan Li; Zi Lin; Ying Sheng; Zhanghao Wu; Hao Zhang; Lianmin Zheng; Siyuan Zhuang; Yonghao Zhuang; Joseph E Gonzalez (2003). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 
[b10] Aakanksha Chowdhery; Sharan Narang; Jacob Devlin; Maarten Bosma; Gaurav Mishra; Adam Roberts; Paul Barham; Hyung Won Chung; Charles Sutton; Sebastian Gehrmann (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research
[b11] Chung Hyung Won; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research
[b12] Peter Clark; Isaac Cowhey; Oren Etzioni; Tushar Khot; Ashish Sabharwal; Carissa Schoenick; Oyvind Tafjord (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. 
[b13] Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano (2021). Training verifiers to solve math word problems. 
[b14] Ronan Collobert; Jason Weston; Léon Bottou; Michael Karlen; Koray Kavukcuoglu; Pavel Kuksa (2011). Natural language processing (almost) from scratch. Journal of machine learning research
[b15] M Andrew; Quoc V Dai;  Le (2015). Semi-supervised sequence learning. Advances in neural information processing systems
[b16] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. 
[b17] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b18] Ricardo Dominguez-Olmedo; Moritz Hardt; Celestine Mendler-Dünner (2023). Questioning the survey responses of large language models. 
[b19] Ricardo Dominguez-Olmedo; Vedant Nanda; Rediet Abebe; Stefan Bechtold; Christoph Engel; Jens Frankenreiter; Krishna Gummadi; Moritz Hardt; Michael Livermore (2024). Lawma: The power of specialization for legal tasks. 
[b20] Yihong Dong; Xue Jiang; Huanyu Liu; Zhi Jin; Bin Gu; Mengfei Yang; Ge Li (2024). Generalization or memorization: Data contamination and trustworthy evaluation for large language models. 
[b21] Richard O Duda; Peter E Hart (1973). Pattern Classification and Scene Analysis. Wiley
[b22]  Eleutherai (2024-05-20). Language model evaluation harness. 
[b23] Clémentine Fourrier; Nathan Habib; Alina Lozovskaya; Konrad Szafer; Thomas Wolf (2024-07-08). Open llm leaderboard v2. 
[b24] Clémentine Fourrier; Nathan Habib; Alina Lozovskaya; Konrad Szafer; Thomas Wolf (2024-07-08). Performances are plateauing, let's make the leaderboard steep again. 
[b25] Yitzhak Samir; Georgios Gadre; Vaishaal Smyrnis; Suchin Shankar; Mitchell Gururangan; Rulin Wortsman; Jean Shao; Alex Mercat; Jeffrey Fang; Sedrick Li;  Keh (2024). Language models scale reliably with over-training and on downstream tasks. CoRR
[b26] Ruyi Gan; Ziwei Wu; Renliang Sun; Junyu Lu; Xiaojun Wu; Dixiang Zhang; Kunhao Pan; Ping Yang; Qi Yang; Jiaxing Zhang (2023). Ziya2: Data-centric learning is all llms need. 
[b27] Deep Ganguli; Danny Hernandez; Liane Lovitt; Amanda Askell; Yuntao Bai; Anna Chen; Tom Conerly; Nova Dassarma; Dawn Drain; Nelson Elhage (2022). Predictability and surprise in large generative models. 
[b28] Leo Gao; Stella Biderman; Sid Black; Laurence Golding; Travis Hoppe; Charles Foster; Jason Phang; Horace He; Anish Thite; Noa Nabeshima; Shawn Presser; Connor Leahy (2020). The Pile: An 800gb dataset of diverse text for language modeling. 
[b29] Lori F John S Garofolo; William M Lamel; Jonathan G Fisher; David S Fiscus;  Pallett (). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc
[b30]  (1993). . NASA STI/Recon technical report n
[b31] Team Gemma; Thomas Mesnard; Cassidy Hardin; Robert Dadashi; Surya Bhupatiraju; Shreya Pathak; Laurent Sifre; Morgane Rivière; Mihir Sanjay Kale; Juliette Love (2024). Open models based on gemini research and technology. 
[b32] Shahriar Golchin; Mihai Surdeanu (2023). Time travel in LLMs: Tracing data contamination in large language models. 
[b33] Dirk Groeneveld; Iz Beltagy; Pete Walsh; Akshita Bhagia; Rodney Kinney; Oyvind Tafjord; Ananya Harsh Jha; Hamish Ivison; Ian Magnusson; Yizhong Wang; Shane Arora; David Atkinson; Russell Authur; Khyathi Chandu; Arman Cohan; Jennifer Dumas; Yanai Elazar; Yuling Gu; Jack Hessel; Tushar Khot; William Merrill; Jacob Morrison; Niklas Muennighoff; Aakanksha Naik; Crystal Nam; Matthew E Peters; Valentina Pyatkin; Abhilasha Ravichander; Dustin Schwenk; Saurabh Shah; Will Smith; Nishant Subramani; Mitchell Wortsman; Pradeep Dasigi; Nathan Lambert; Kyle Richardson; Jesse Dodge; Kyle Lo; Luca Soldaini; Noah A Smith; Hannaneh Hajishirzi (2024). Olmo: Accelerating the science of language models. 
[b34] Yuling Gu; Oyvind Tafjord; Bailey Kuehl; Dany Haddad; Jesse Dodge; Hannaneh Hajishirzi (2024). Olmes: A standard for language model evaluations. 
[b35] Suriya Gunasekar; Yi Zhang; Jyoti Aneja; Caio César; Teodoro Mendes; Allie Del Giorno; Sivakanth Gopi; Mojan Javaheripi; Piero Kauffmann; Gustavo De Rosa; Olli Saarikivi (2023). Textbooks are all you need. 
[b36] Moritz Hardt; Benjamin Recht (2022). Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press
[b37] Trevor Hastie; Robert Tibshirani; Jerome Friedman (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer
[b38] Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt (2020). Measuring massive multitask language understanding. 
[b39] Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2021). Measuring mathematical problem solving with the math dataset. 
[b40] Jordan Hoffmann; Sebastian Borgeaud; Arthur Mensch; Elena Buchatskaya; Trevor Cai; Eliza Rutherford; Diego De Las; Lisa Anne Casas; Johannes Hendricks; Aidan Welbl;  Clark (2022). Training compute-optimal large language models. 
[b41] Or Honovich; Thomas Scialom; Omer Levy; Timo Schick (2022). Unnatural instructions: Tuning language models with (almost) no human labor. 
[b42]  (2023). Internlm: A multilingual language model with progressively enhanced capabilities. 
[b43] Minhao Jiang; Ken Liu; Ming Zhong; Rylan Schaeffer; Siru Ouyang; Jiawei Han; Sanmi Koyejo (2024). Does data contamination make a difference? insights from intentionally contaminating pretraining data for language models. 
[b44] Jared Kaplan; Sam Mccandlish; Tom Henighan; Tom B Brown; Benjamin Chess; Rewon Child; Scott Gray; Alec Radford; Jeffrey Wu; Dario Amodei (2020). Scaling laws for neural language models. 
[b45] Sayash Kapoor; Arvind Narayanan (2022). Leakage and the reproducibility crisis in ml-based science. 
[b46] Jacob Devlin; Ming-Wei Chang; Kenton ; Lee Kristina; Toutanova  (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b47] Philipp Koehn (2005). Europarl: A parallel corpus for statistical machine translation. International Association for Machine Translation
[b48] Nathan Lambert; Jacob Morrison; Valentina Pyatkin; Shengyi Huang; Hamish Ivison; Faeze Brahman; Lester James; V Miranda; Alisa Liu; Nouha Dziri; Shane Lyu (2024). ulu 3: Pushing frontiers in open language model post-training. 
[b49] Yann Lecun; Corinna Cortes;  Burges (1998). Mnist handwritten digit database. 
[b50] Changmao Li; Jeffrey Flanigan (2024). Task contamination: Language models may not be few-shot anymore. 
[b51] Jeffrey Li; Alex Fang; Georgios Smyrnis; Maor Ivgi; Matt Jordan; Samir Gadre; Hritik Bansal; Etash Guha; Sedrick Keh; Kushal Arora (2024). Datacomp-lm: In search of the next generation of training sets for language models. 
[b52] Percy Liang; Rishi Bommasani; Tony Lee; Dimitris Tsipras; Dilara Soylu; Michihiro Yasunaga; Yian Zhang; Deepak Narayanan; Yuhuai Wu; Ananya Kumar (2023). Holistic evaluation of language models. Transactions on Machine Learning Research
[b53] Ilya Loshchilov; Frank Hutter (2018). Decoupled weight decay regularization. 
[b54] Sheng Lu; Irina Bigoulaeva; Rachneet Sachdeva; Harish Tayyar Madabushi; Iryna Gurevych (2023). Are emergent abilities in large language models just in-context learning?. 
[b55] Inbal Magar; Roy Schwartz (2022). Data contamination: From memorization to exploitation. 
[b56]  Metaai (2024). Llama 3: Advancing open foundation models. 
[b57] Arindam Mitra; Hamed Khanpour; Corby Rosset; Ahmed Awadallah (2024). Orca-math: Unlocking the potential of slms in grade school math. 
[b58] Subhabrata Mukherjee; Arindam Mitra; Ganesh Jawahar; Sahaj Agarwal; Hamid Palangi; Ahmed Awadallah (2023). Orca: Progressive learning from complex explanation traces of gpt-4. 
[b59]  Openllama;  Openllama (2009). . Cambridge university press
[b60] Judea Pearl (2013). Linear models: A useful "microscope" for causal analysis. Journal of Causal Inference
[b61] Mark Matthew E Peters; Mohit Neumann; Matt Iyyer; Christopher Gardner; Kenton Clark; Luke Lee;  Zettlemoyer (2018). Deep contextualized word representations. NAACL
[b62] Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2019). Language models are unsupervised multitask learners. OpenAI blog
[b63] Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J Liu (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research
[b64] Samyam Rajbhandari; Jeff Rasley; Olatunji Ruwase; Yuxiong He (2020). Zero: Memory optimizations toward training trillion parameter models. IEEE
[b65] David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael;  Samuel R Bowman (2023). Gpqa: A graduate-level google-proof q&a benchmark. 
[b66] Manley Roberts; Himanshu Thakur; Christine Herlihy; Colin White; Samuel Dooley (2023). Data contamination through the lens of time. 
[b67] Anna Rogers; Sasha Luccioni (2024). Position: Key claims in llm research have a long tail of footnotes. 
[b68] Oscar Sainz; Jon Ander Campos; Iker García-Ferrero; Julen Etxaniz; Oier Lopez De Lacalle; Eneko Agirre (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. 
[b69] Erik F ; Tjong Kim; Sang ; Fien De; Meulder  (2003). Introduction to the conll-2003 shared task: Languageindependent named entity recognition. Development
[b70] Rylan Schaeffer; Brando Miranda; Sanmi Koyejo (2024). Are emergent abilities of large language models a mirage. Advances in Neural Information Processing Systems
[b71] Rylan Schaeffer; Hailey Schoelkopf; Brando Miranda; Gabriel Mukobi; Varun Madan; Adam Ibrahim; Herbie Bradley; Stella Biderman; Sanmi Koyejo (2024). Why has predicting downstream capabilities of frontier ai models with scale remained elusive?. 
[b72] Charlie Victor Snell; Eric Wallace; Dan Klein; Sergey Levine (2024). Predicting emergent capabilities by finetuning. 
[b73] Zayne Rea Sprague; Xi Ye; Kaj Bostrom; Swarat Chaudhuri; Greg Durrett (2023). Musr: Testing the limits of chain-of-thought with multistep soft reasoning. 
[b74] Aarohi Srivastava; Abhinav Rastogi; Abhishek Rao; Abu Awal; Md Shoeb; Abubakar Abid; Adam Fisch; Adam Adam R Brown; Aditya Santoro; Adrià Gupta;  Garriga-Alonso (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. 
[b75]  Stabilityai;  Stablelm (2023). . 
[b76] Shane Storks; Qiaozi Gao; Joyce Y Chai (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. 
[b77] Mirac Suzgun; Nathan Scales; Nathanael Schärli; Sebastian Gehrmann; Yi Tay; Hyung Won Chung; Aakanksha Chowdhery; Quoc Le; Ed Chi; Denny Zhou (2023). Challenging big-bench tasks and whether chain-of-thought can solve them. 
[b78] Rohan Taori; Ishaan Gulrajani; Tianyi Zhang; Yann Dubois; Xuechen Li; Carlos Guestrin; Percy Liang; Tatsunori B Hashimoto (2023). Stanford alpaca: An instruction-following llama model. 
[b79]  (2023). Redpajama incite. TogetherWeCompute
[b80] Hugo Touvron; Thibaut Lavril; Gautier Izacard; Xavier Martinet; Marie-Anne Lachaux; Timothée Lacroix; Baptiste Rozière; Naman Goyal; Eric Hambro; Faisal Azhar (2023). Llama: Open and efficient foundation language models. 
[b81] Hugo Touvron; Louis Martin; Kevin Stone; Peter Albert; Amjad Almahairi; Yasmine Babaei; Nikolay Bashlykov; Soumya Batra; Prajjwal Bhargava; Shruti Bhosale (2023). Llama 2: Open foundation and fine-tuned chat models. 
[b82] Lewis Tunstall; Edward Beeching; Nathan Lambert; Nazneen Rajani; Kashif Rasul; Younes Belkada; Shengyi Huang; Leandro Von Werra; Clémentine Fourrier; Nathan Habib (2023). Zephyr: Direct distillation of lm alignment. 
[b83] Ben Wang; Aran Komatsuzaki (2021-05). GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. 
[b84] Yizhong Wang; Swaroop Mishra; Pegah Alipoormolabashi; Yeganeh Kordi; Amirreza Mirzaei; Atharva Naik; Arjun Ashok; Arut Selvan Dhanasekaran; Anjana Arunkumar; David Stap (2022). Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. 
[b85] Yubo Wang; Xueguang Ma; Ge Zhang; Yuansheng Ni; Abhranil Chandra; Shiguang Guo; Weiming Ren; Aaran Arulraj; Xuan He; Ziyan Jiang (2024). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. 
[b86] Jason Wei; Maarten Bosma; Vincent Zhao; Kelvin Guu; Adams Wei Yu; Brian Lester; Nan Du; Andrew M Dai; Quoc V Le (2022). Finetuned language models are zero-shot learners. 
[b87] Jason Wei; Yi Tay; Rishi Bommasani; Colin Raffel; Barret Zoph; Sebastian Borgeaud; Dani Yogatama; Maarten Bosma; Denny Zhou; Donald Metzler (2022). Emergent abilities of large language models. 
[b88] Tianwen Wei; Liang Zhao; Lichang Zhang; Bo Zhu; Lijie Wang; Haihua Yang; Biye Li; Cheng Cheng; Weiwei Lü; Rui Hu (2023). Skywork: A more open bilingual foundation model. 
[b89] Zhaofeng Wu; Linlu Qiu; Alexis Ross; Ekin Akyürek; Boyuan Chen; Bailin Wang; Najoung Kim; Jacob Andreas; Yoon Kim (2023). Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. 
[b90] Aiyuan Yang; Bin Xiao; Bingning Wang; Borong Zhang; Ce Bian; Chenxu Chao Yin; Da Lv; Dian Pan; Dong Wang;  Yan (2023). Baichuan 2: Open large-scale language models. 
[b91] An Yang; Baosong Yang; Binyuan Hui; Bo Zheng; Bowen Yu; Chang Zhou; Chengpeng Li; Chengyuan Li; Dayiheng Liu; Fei Huang (2024). Qwen2 technical report. CoRR
[b92] Shuo Yang; Wei-Lin Chiang; Lianmin Zheng; Joseph E Gonzalez; Ion Stoica (2023). Rethinking benchmark and contamination for language models with rephrased samples. 
[b93] Alex Young; Bei Chen; Chao Li; Chengen Huang; Ge Zhang; Guanwei Zhang; Heng Li; Jiangcheng Zhu; Jianqun Chen; Jing Chang (2024). Open foundation models by 01. 
[b94] Dingli Yu; Simran Kaur; Arushi Gupta; Jonah Brown-Cohen; Anirudh Goyal; Sanjeev Arora (2023). Skill-mix: A flexible and expandable family of evaluations for ai models. 
[b95] Longhui Yu; Weisen Jiang; Han Shi; Y U Jincheng; Zhengying Liu; Yu Zhang; James Kwok; Zhenguo Li; Adrian Weller; Weiyang Liu (2023). Metamath: Bootstrap your own mathematical questions for large language models. 
[b96] Rowan Zellers; Ari Holtzman; Yonatan Bisk; Ali Farhadi; Yejin Choi (2019). Hellaswag: Can a machine really finish your sentence. 
[b97] Ge Zhang; Scott Qu; Jiaheng Liu; Chenchen Zhang; Chenghua Lin; Leuang Chou; Danny Yu; Esther Pan; Jie Cheng; Qunshu Liu; Raven Lin; Tuney Yuan; Wei Zheng; Xinrun Pang; Yiming Du; Yinghao Liang; Yizhi Ma; Ziyang Li; Bill Ma; Emmanouil Lin; Huan Benetos; Junting Yang; Kaijing Zhou; Minghao Ma; Morry Liu; Noah Niu; Quehry Wang; Ruibo Que; Sine Liu; Shawn Liu; Soren Guo; Wangchunshu Gao; Xinyue Zhou; Yizhi Zhang; Yubo Zhou; Yuelin Wang; Yuhan Bai; Yuxiang Zhang; Zenith Zhang; Zhenzhu Wang; Zijian Yang; Jiajun Zhao; Wanli Zhang; Wenhao Ouyang; Wenhu Huang;  Chen (2024). Map-neo: Highly capable and transparent bilingual large language model series. 
[b98] Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou (2023). Instruction-following evaluation for large language models. 
[b99] Kun Zhou; Yutao Zhu; Zhipeng Chen; Wentong Chen; Wayne Xin Zhao; Xu Chen; Yankai Lin; Ji-Rong Wen; Jiawei Han (2023). Don't make your LLM an evaluation benchmark cheater. 
[b100]  Biderman (2022). . 
[b101]  (2023). EleutherAI/pythia-12. 
[b102]  (2023). EleutherAI/pythia-160m. 
[b103]  Biderman (2023). pythia-2.8b 2022-10 2.8 0.3 EleutherAI/pythia. 
[b104]  Biderman (2023). pythia-6.9b 2022-10 6.9 0.3 EleutherAI/pythia. 
[b105]  Biderman (2023). EleutherAI/pythia-70m. 
[b106]  (2023). . qwen
[b107]  (2023). . qwen
[b108]  (2023). b 2024. qwen
[b109]  (). 8 togethercomputer/RedPajama-INCITE-Base-3B-v1 internlm2-7b 2024-01 internlm2-base-7b internlm/internlm2-7b. 
[b110]  Cai (2024). internlm2-chat-1 8b 2024-01 internlm2-base-7b internlm/internlm2-chat-1 8b. 
[b111]  Cai (2024). internlm2-chat-20b 2024-01 internlm2-base-20b internlm/internlm2-chat-20b. 
[b112]  Cai (2023). b) llama-3-8binstruct 2024-04 llama-3-8b meta-llama/Meta-Llama-3-8B-Instruct MetaAI. 
[b113]  Zhang (2024). map-neo-7b-sft 2024-05 map-neo-7b m-a-p/neo 7b sft v0. 
[b114]  Zhang (2024). olmo-7b-0724-instruct-hf 2024. 
[b115]  (). olmo-7b allenai/OLMo-7B-0724-Instruct-hf. 
[b116]  Groeneveld (2024). olmo-7b-0724-sfthf 2024-01 allenai. 
[b117]  Groeneveld (2024). olmo-7b-instructhf 2024-01 olmo-7b allenai/OLMo-7B-Instructhf. 
[b118]  Groeneveld (2023). redpajama-7b-chat 2023-05 redpajama-7b togethercomputer/RedPajama-INCITE-7B-Chat TogetherWeCompute (2023) redpajama-chat-3b-v1 2023-05 redpajama-3b togethercomputer/RedPajama-INCITE-Chat-3B-v1 TogetherWeCompute (2023) redpajama. -Chat Bai et al
[b119]  Bellagente (2024). stablelm-zephyr-3b 2023-11 stablelm-3b-4e1t stabilityai/stablelm-zephyr-3b. 
[b120]  Tunstall (2023). llama-13b lmsys/vicuna-13b-v1. 
[b121]  Chiang (2023). llama-13b lmsys/vicuna-13b-v1. 
[b122]  Chiang (2023). b-v1.1 2023. 
[b123]  Chiang (2023). b-v1. 
[b124]  Chiang (2020). B.2 FINE-TUNING HYPERPARAMETERS We fine-tune all model parameters. For models with less than 10B parameters. 

Figures:
Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Models trained before November 2023 tend to benefit more from fine-tuning on task data.
Data: 

Figure fig_3: 3
Type: figure
Caption: Figure 3 :3Figure 3: Models trained before November 2023 (•) without fine-tuning and (•) after fine-tuning on the test task. Their difference in benchmark performance θ resembles that between newer and older models. After adjusting by training on the test task, their difference vanishes.
Data: 

Figure fig_4: 4
Type: figure
Caption: Figure 4 :4Figure 4: Reformulating ARC and HellaSwag as MMLU-style questions give rise to large differences θ between models trained (•) before November 2023 and (•) after November 2023. After adjusting by fine-tuning on the test task, differences in performance vanish.
Data: 

Figure fig_5: 5
Type: figure
Caption: Figure 5 :5Figure 5: When evaluating MMLU using "cloze" prompts, models trained (•) after November 2023 no longer outperform those trained (•) before November 2023 (middle). When using Brier score as the evaluation metric, we still observe sharp improvements in performance (right).
Data: 

Figure fig_7: 67
Type: figure
Caption: Figure 6 :Figure 7 :67Figure 6: Training on the test task confounds relative comparisons between model families. After adjusting for test task training, none of the three model families appears to be superior.
Data: 

Figure fig_8: 8
Type: figure
Caption: Figure 8 :8Figure 8: Scaling on MMLU as models increasingly train on the test task. The point of emergence c e arises at lower scales (top). Training on the test task yields cleaner log-linear scaling fits (bottom).
Data: 

Figure fig_9: 10
Type: figure
Caption: Figure 10 :10Figure 10: Whether a model was trained after November 2023 (N ) influences its pretraining compute (C) and how much it trains on the test task (T ). All three influence the benchmark accuracy (A) of the model.
Data: 

Figure fig_10: 
Type: figure
Caption: trained before September 2023 Models trained Before September 2023 After September 2023
Data: 

Figure fig_11: 11
Type: figure
Caption: Figure 11 :11Figure 11: Robustness check with September 2023 as the temporal cutoff.
Data: 

Figure fig_12: 12
Type: figure
Caption: Figure 12 :12Figure 12: Robustness check with January 2024 as the temporal cutoff.
Data: 

Figure fig_13: 
Type: figure
Caption: trained primarely on EN data Models trained Primarily on EN On both EN and CN
Data: 

Figure fig_14: 13
Type: figure
Caption: Figure 13 :13Figure13: Models trained on both English (EN) and Chinese (CN) language data outperform those trained primarily on English data. After adjusting for test task training, we find no evidence of a significant difference θ in performance between models trained on EN data and EN+CN data.
Data: 

Figure fig_15: 14
Type: figure
Caption: Figure 14 :14Figure 14: New models resemble old models that were fine-tuned. Temporal cut-off: November 2023.
Data: 

Figure fig_16: 
Type: figure
Caption: Models trained Before November 2023 After November 2023Bold indicates statistical significance with p ¡ 0.05.
Data: 

Figure fig_17: 
Type: figure
Caption: November 2023 After November 2023 Bold indicates statistical significance with p < 0.05.
Data: 

Figure fig_18: 17
Type: figure
Caption: Figure 17 :17Figure 17: Training on the test task significantly alters model rankings on MMLU.
Data: 

Figure fig_20: 1819
Type: figure
Caption: Figure 18 :Figure 19 :1819Figure 18: ARC and HellaSwag scores of models trained (•) before November 2023 and (•) after. Middle: reformulating the test task as multiple-choice leads to emergence around 10 22 to 10 23FLOPs. Right: when using Brier score as the metric, we similarly observe sharp changes in performance around 10 22 to 10 23 FLOPs.
Data: 

Figure fig_22: 20
Type: figure
Caption: Figure 20 :20Figure 20: Reproducing Figure 1 for instruction-tuned and chat models.
Data: 

Figure tab_0: 
Type: table
Caption: Figure 9: We perform a learning rate sweep using values [6 • 10 -5 , 2 • 10 -5 , 6 • 10 -6 , 2 • 10 -6 , 6 • 10
Data: map-neo-7b olmo-1.7-7b olmo-1b olmo-7b openllama-13b openllama-3b openllama-3b-v2 0.3 0.4 0.5 0.6 0.7 MMLU (Post-adjustment) Difference = 0.005 2024-05 7 Original hyperparameters 4.5 2024-04 7 2.0 2024-01 1 2.0 2024-01 7 2.5 2023-06 13 1.0 2023-06 3 1.0 2023-07 3 1.0 Regression R 2 = 0.990m-a-p/neo 7b Best hyperparameters for newer models Zhang et al. (2024) allenai/OLMo-1.7-7B-hf Groeneveld et al. (2024) allenai/OLMo-1B-hf Groeneveld et al. (2024) allenai/OLMo-7B-hf Groeneveld et al. (2024) openlm-research/open llama 13b OpenLlama (2023) openlm-research/open llama 3b OpenLlama (2023) openlm-OpenLlama (2023) MMLU (Post-adjustment) Difference = 0.005 Regression R 2 = 0.992research/open llama 3b v2 openlm-research/open llama 7b 10 24 Pre-training compute (FLOPs) 2023-06 7 1.0 10 20 openllama-7b 10 21 10 22 10 23 10 20 10 21 10 22 Pre-training compute (FLOPs) OpenLlama (2023) 10 23 10 24openllama-7b-v22023-0771.0openlm-OpenLlama (2023)research/open llama 7b v2baichuan-13b redpajama-7b baichuan-7b skywork-13b baichuan2-13b stablelm-2-1.6b baichuan2-7b stablelm-2-12b stablelm-3b-4e1t falcon-11b stablelm-base-falcon-7b alpha-3b-v2 gemma-2b stablelm-base-gemma-7b alpha-7b-v2 gpt-j-6b yi-6b internlm-20b ziya2-13b-baseTrain date 2023-06 2023-05 2023-06 2023-10 2023-09 2024-01 2023-09 2024-03 2023-09 2024-05 2023-08 2023-04 2024-02 2023-08 2024-02 2021-03 2023-11 2023-09 2023-11N 13 7 7 13 13 1.6 7 12.1 2.8 11 2.8 7 2 7 7 6 6 20 13D 1.4 1.0 1.2 3.2 2.6 2.0 2.6 2.0 4.0 5.0 1.1 1.5 3.0 1.1 6.0 0.4 3.1 2.3 2.65 IDEA-CCNL/Ziya2-13B-HF repository baichuan-inc/Baichuan-togethercomputer/RedPajama-13B-Base baichuan-inc/Baichuan2-INCITE-7B-Base Skywork/Skywork-13B-7B-Base baichuan-inc/Baichuan2-base stabilityai/stablelm-2-1 6b 13B-Base baichuan-inc/Baichuan2-stabilityai/stablelm-2-12b stabilityai/stablelm-3b-4e1t StabilityAI (2023) TogetherWeCompute Citation (2023) Yang et al. (2023a) TogetherWeCompute (2023) Yang et al. (2023a) Wei et al. (2023) Yang et al. (2023a) Bellagente et al. (2024) Bellagente et al. (2024) Yang et al. (2023a) 7B-Base tiiuae/falcon-11B stabilityai/stablelm-base-StabilityAI (2023) Almazrouei et al. (2023) tiiuae/falcon-7b alpha-3b-v2 Almazrouei et al. (2023) google/gemma-2b stabilityai/stablelm-base-StabilityAI (2023) Gemma et al. (2024) google/gemma-7b alpha-7b-v2 Gemma et al. (2024) EleutherAI/gpt-j-6b 01-ai/Yi-1.5-6B Young et al. (2024) Wang & Komatsuzaki internlm/internlm-20b InternLM (2023) Base (2021) Gan et al. (2023)internlm-7b2023-0771.0internlm/internlm-7bInternLM (2023)internlm2-base-20b B.1.1 INSTRUCTION-TUNED AND CHAT MODELS 2024-01 20 2.6 internlm/internlm2-base-20bCai et al. (2024)internlm2-base-7b2024-0172.6internlm/internlm2-base-7b Cai et al. (2024)llama-13b Name2023-02 Train date Base 131.0None HF repositoryTouvron et al. (2023a) Citationllama-2-13b2023-0713 model2.0meta-llama/Llama-2-13b-Touvron et al. (2023b)falcon-7b-instruct2023-04falcon-7bhf tiiuae/falcon-7b-instructAlmazrouei et al. (2023)llama-2-70b gemma-2b-2023-07 2024-0270 gemma-2b google/gemma-2b-it 2.0 meta-llama/Llama-2-70b-Touvron et al. (2023b) Gemma et al. (2024)instructhfllama-2-7b gemma-7b-2023-07 2024-027 gemma-7b google/gemma-7b-it 2.0 meta-llama/Llama-2-7b-hfTouvron et al. (2023b) Gemma et al. (2024)llama-3-8b instruct2024-04815.0 meta-llama/Meta-Llama-3-MetaAI (2024)internlm-chat-20b2023-09internlm-8B internlm/internlm-chat-20b InternLM (2023)llama-30b2023-0232.5 20b1.4NoneTouvron et al. (2023a)llama-65b internlm-chat-7b2023-02 2023-0765.2 internlm-1.4None internlm/internlm-chat-7bTouvron et al. (2023a) InternLM (2023)llama-7b2023-027 7b1.0NoneTouvron et al. (2023a)

Figure tab_1: 3
Type: table
Caption: The indirect effect N → T → A mediated by test task training T is positive, significant, and large: newer models attain higher benchmark scores primarily because of training on the test task.
Data: MMLUGSM8Kφ0.071 (0.018)0.168 (0.032)R 20.5300.503Standard errors in parentheses. Bold indicatesp < 0.05.

Figure tab_2: 4
Type: table
Caption: We find no evidence of a significant direct effect of model recency N on accuracy A, that is, of the improvements of newer models being attributable to anything else other than training on the test task.
Data: MMLUGSM8Kψ-0.004 (0.009)0.000 (0.032)R 20.9260.763Standard errors in parentheses. Bold indicatesp < 0.05.

Figure tab_3: 5
Type: table
Caption: Accuracy in discriminating between older and newer models in terms of their pretraining compute and benchmark accuracy. Older, fine-tuned models are indistinguishable from newer models.
Data: Discriminator testMMLUGSM8KOlder models vs newer models64.6%73.9%Fine-tuned, older models vs newer models52.2%52.5%Random chance accuracy is 50%.


Formulas:
Formula formula_0: A = α max(0, log C -c e ) + θN + r + ϵ,(1)

Formula formula_1: f (C, α) = α0 + |α| i=1 αi log C • [C > ci](2)

Formula formula_2: T := f (C, β) + ϕN + δ, δ ∼ N (0, σ 2 δ )(3)

Formula formula_3: A := f (C, α) + ψN + γT + η + ϵ, ϵ ∼ N (0, σ 2 ϵ )(4)

Formula formula_4: A| do(T =t) = f (C, α) + ψN + γt + η + ϵ = f (C, α) + ψN + η ′ + ϵ, η ′ = η + γt(5)

Formula formula_5: A -A| do(T =t) = (f (C, α) + ψN + γT + η + ϵ1) -(f (C, α) + ψN + γt + η + ϵ2) = γT -γt + ϵ1 -ϵ2 = f (C, γβ) + γϕN + γδ -γt + ϵ1 -ϵ2 = f (C, β ′ ) + ϕ ′ N + b + ϵ ′ , for β ′ = γβ, ϕ ′ = γϕ, b = -γt, ϵ ′ = ϵ1 -ϵ2 + γδ(6)
