Causally Quantifying the Effect of Test Set Contamination on Generative Benchmarks

Rylan Schaeffer; Brando Miranda; Joshua Kazdan; Ken Liu; Ahmed M Ahmed; Niloofar Mireshghallah; Sanmi Koyejo

Causally Quantifying the Effect of Test Set Contamination on Generative Benchmarks

Rylan Schaeffer, Brando Miranda, Joshua Kazdan, Ken Liu, Ahmed M Ahmed, Niloofar Mireshghallah, Sanmi Koyejo

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: test set contamination, data contamination, large language models, LLMs, generative benchmarks, model evaluation, causal analysis, memorization, sampling temperature, mathematical reasoning, trustworthy AI

TL;DR: We intentionally contaminate pretraining data with generative benchmarks to test how test set contamination compares to discriminative evals.

Abstract: As large language models (LLMs) are pretrained on ever-expanding web-scale data, test set contamination has become a critical concern for accurately assessing the capabilities of LLMs. While significant research has quantified the amount and the impact of test set contamination on discriminative (i.e., scoring-based) benchmarks like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative (i.e., sampling-based) evaluations such as coding or mathematical problem solving. As the field shifts more towards generative evaluations, understanding what effect (if any) test set contamination has on generative evaluations becomes all the more important. To causally quantify the effect that test set contamination has on assessed capabilities, we pretrained language models, sweeping the number of replicas of benchmark test data in the pretraining corpora. We make four discoveries: (1) performance increases with contamination and model size, consistent with discriminative evaluations, (2) higher sampling temperature mitigates the effects of contamination, (3) longer solutions require more contamination to reach the same level of performance, and (4) generative performance is tightly coupled with test set memorization, but modulated by sampling temperature. As the field shifts to generative benchmarks to assess reasoning, our work reveals that factors like sampling temperature and solution length introduce novel complexities to data contamination, demanding a more sophisticated approach to model evaluation.

Submission Number: 177

Loading