Detecting Benchmark Contamination Through Watermarking

Tom Sander; Pierre Fernandez; Saeed Mahloujifar; Alain Oliviero Durmus; Chuan Guo

Detecting Benchmark Contamination Through Watermarking

Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Oliviero Durmus, Chuan Guo

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: LLM, Watermarking, Benchmark, Contamination

TL;DR: Watermarking benchmarks appear like an promising solution to the problem of contamination in LLMs: we can maintain benchmark utility while successfully identifying contamination, e.g. p-val =10^{-3} for +5% on ARC-Easy

Abstract: Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release.The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark quality and utility. During evaluation, we can detect ``radioactivity'', i.e. traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-rephrasing and successful contamination detection when models are contaminated enough to enhance performance, e.g. p-val =$10^{-3}$ for +5% on ARC-Easy.

Presenter: ~Tom_Sander1

Format: Yes, the presenting author will definitely attend in person because they are attending ICLR for other complementary reasons.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 28

Loading