Keywords: test set contamination, loss trajectory analysis, learning dynamics
Abstract: Test set contamination poses a serious threat to reliable model evaluation. Whether inadvertent or deliberate, contamination may lead to misrepresenting model capabilities to both researchers and the public: this spurious performance augmentation may in turn cause harm when these models are deployed in real-world applications. In this work, we propose a novel test set contamination detection method that relies solely on analyzing loss trajectories during deliberate fine-tuning on target benchmarks. Our key insight is that models exhibit quantifiably different learning dynamics when exposed to previously encountered versus novel data. Concretely, we systematically fine-tune models on test data mixed within decontaminated data at varying proportions to simulate contamination scenarios, and fine-tune on decontaminated data only to simulate the clean counterparts. We show that clustering methods using as few as 200 data points can distinguish clean from contaminated scenarios with +95\% accuracy. Our method also demonstrates superior robustness in detecting contamination of paraphrased evaluation data compared to membership inference attack baselines, which operate at the individual sample level and typically target verbatim matches. Critically, our approach represents a paradigm shift from static detection metrics to dynamic training-based assessment: observing how models react to controlled fine-tuning on target data rather than analyzing fixed outputs or input manipulations. We posit that this intervention-based methodology offers inherently higher resistance to detection evasion, as the metrics cannot be directly optimized as reward signals during model development, providing a more robust foundation for maintaining evaluation integrity.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22451
Loading