Detecting Benchmark Contamination Through Watermarking

Detecting Benchmark Contamination Through Watermarking

ICLR 2026 Conference Submission16853 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Watermarking, Dataset inference, contamination

TL;DR: We introduce a method to watermark benchmark questions by subtly rephrasing them, which allows us to statistically detect if a model has been illegitimately trained on test data.

Abstract: Benchmark contamination undermines LLM evaluations, and existing post-hoc detection methods are inferential and thus lack verifiable guarantees. We propose a proactive solution: embedding cryptographic watermarks into benchmarks \emph{before} their release through question reformulation with a language model, and introduce a detection algorithm that overcomes tokenizer mismatches by aligning text prefixes to reliably identify the watermark signal in the suspect model. To validate our method, we pre-train 1B-parameter models on 10B tokens with controlled contamination of MMLU and ARC. The watermarking process preserves benchmark utility, while our test detects contamination with high confidence, achieving e.g. a $p$-value $< 10^{-5}$ for a mere 5\% performance gain on 5000 MMLU questions.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 16853

Loading