Keywords: data contamination, search agents, evaluations
TL;DR: We identify search-time contamination (STC), when a public evaluation benchmark is retrieved by a search agent, enabling it to copy the answer thereby compromising the test validity.
Abstract: Data contamination refers to the leakage of evaluation data into model training data, breaking test validity. We identify an analogous issue—search-time contamination (STC), which occurs when the retrieval step of a search agent surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy the answer. On three commonly used capability benchmarks—Humanity's Last Exam (HLE), SimpleQA, and GPQA—we demonstrate that for approximately 3\% of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace, with an up to 20\% difference on HLE. After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset. We further show through search ablations that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To facilitate the auditing of evaluation results, we will publicly release the complete logs from our experiments.
Submission Number: 18
Loading