The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible

Rylan Schaeffer; Ken Liu; Brando Miranda; Ahmed M Ahmed; Niloofar Mireshghallah; Sanmi Koyejo

The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible

Rylan Schaeffer, Ken Liu, Brando Miranda, Ahmed M Ahmed, Niloofar Mireshghallah, Sanmi Koyejo

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Data Contamination, Memorization, Benchmark Evaluation, Dose-Response Relationship, Scaling Laws, Model Capacity, Pretraining Corpora, Overtraining, Test Set Leakage

TL;DR: Language models memorize benchmark test data based on their capacity and the training incentive to do so, using a dose-response model to quantitatively show how the percentage of contaminated data affects benchmark accuracy

Abstract: Accurately evaluating the capabilities of large language models is critical for both machine learning research and society alike, but is undermined by leakage of benchmark test data into pretraining corpora. Circumstantial and causal evidence alike demonstrate that benchmark performance increases with model size and with the number of benchmark replicas in pretraining corpora. However, recent work by Bordt et al. (2025) demonstrated that test set contamination has little-to-no impact in the "overtrained'' regime common to frontier AI systems, raising an apparent paradox of how test set leakage can be both potent _and_ negligible. We resolve this paradox with a simple explanation: a language model memorizes a benchmark test set based on its capacity (number of parameters) and its incentive (the relative training loss reduction from memorizing test data). We introduce a novel dose-response framework to quantitatively relate how the "response'' of benchmark performance depends on the "dose" of the proportion of benchmark tokens contaminating the pretraining data, mediated by model size. This allows us to extract precise scaling relationships that clarify the effect of test set contamination on model performance.

Submission Number: 47

Loading