DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dataset contamination, fair LLM evaluation, backdoor attacks
TL;DR: We introduce DyePack, a framework that leverages backdoor attacks to detect test set contamination for LLMs, without requiring the model's loss, logits, or internals. Crucially, we offer bounded and exactly computable false positive rate guarantee.
Abstract: Open benchmarks are vital for evaluating large language models, but their accessibility makes them prone to test set contamination. We introduce DyePack, a framework that uses backdoor attacks to detect models trained on benchmark test sets, without requiring access to model loss or logits, yet providing provable false positive guarantees. Like banks mixing dye packs with money to mark robbers, DyePack inserts backdoor samples into test data to flag contaminated models. Our design combines multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation—provably preventing false accusations while ensuring strong evidence of contamination. We evaluate DyePack on five models across three datasets, covering multiple-choice and open-ended tasks. For multiple-choice, it detects all contaminated models with FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended tasks, it generalizes well, detecting all contaminated models on Alpaca with a guaranteed FPR of just 0.127% using six backdoors.
Submission Number: 170
Loading