Keywords: Watermarking, Dataset inference, contamination
TL;DR: We introduce a method to watermark benchmark questions by subtly rephrasing them, which allows us to statistically detect if a model has been illegitimately trained on test data.
Abstract: Benchmark contamination undermines LLM evaluations, and existing post-hoc detection methods are inferential and thus lack verifiable guarantees. We propose a proactive solution: embedding cryptographic watermarks into benchmarks \emph{before} their release through question reformulation with a language model, and introduce a detection algorithm that overcomes tokenizer mismatches by aligning text prefixes to reliably identify the watermark signal in the suspect model. To validate our method, we pre-train 1B-parameter models on 10B tokens with controlled contamination of MMLU and ARC. The watermarking process preserves benchmark utility, while our test detects contamination with high confidence, achieving e.g. a $p$-value $< 10^{-5}$ for a mere 5\% performance gain on 5000 MMLU questions.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16853
Loading