TL;DR: We show that moderate amounts of data contamination are forgotten by the end of LLM training runs.
Abstract: The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Llama 3 405B, have forgotten the data seen at the beginning of training.
Lay Summary: When training large language models (LLMs), researchers worry about "data contamination" - what happens when a test question accidentally appears in the training data? If a model has seen an answer during training, will it be able to "cheat"?
In this paper, we conduct controlled experiments where we deliberately provide LLMs with answers to evaluation questions. We study how a model's ability to answer depends on how often questions were seen during training, the number of model parameters, and the overall size of the training data.
Our main finding is that the effect of contamination depends strongly on the training setup. On the one hand, if a model is very large or has been exposed to a test question frequently, this can lead to overfitting. On the other hand, exposure to sufficient new data mitigates the overfitting up to the point where the LLM has forgotten that it ever saw the test question.
Our research demonstrates that large training datasets provide natural protection against accidental contamination, which has important implications for how we evaluate and train AI systems.
Link To Code: https://github.com/tml-tuebingen/forgetting-contamination/
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Contamination, Forgetting, Scaling, Optimization
Submission Number: 9771
Loading