A Dye Pack Framework for Detecting Test Set Contamination in LLMs

ACL ARR 2025 February Submission6083 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination, where models inadvertently or intentionally train on test data, leading to inflated performance and unfair evaluations. In this work, we introduce a novel dye pack framework, which leverages backdoor attacks to identify models that used benchmark test sets during training. Like how banks mix dye packs with their money to mark robbers, our dye pack framework mixes backdoor samples with the test data to flag models that have been trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. As a proof of concept, we evaluate our dye pack framework on two benchmarks. Using eight backdoors, our framework could successfully catch every contaminated model in our evaluation with guaranteed false positive rates of only 0.000073\% on a subset of MMLU-Pro and 0.00085\% on a subset of Big-Bench-Hard, highlighting its potential as powerful protection for open benchmarks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies
Contribution Types: Theory
Languages Studied: English
Submission Number: 6083
Loading