Keywords: Data Contamination; Reliable LLM Evaluation
Abstract: Large language models (LLMs) have achieved impressive performance across diverse tasks, largely driven by large-scale pretraining data. However, this data abundance has led to a critical issue: test data contamination, where benchmark datasets inadvertently overlap with pretraining corpora. This contamination compromises the reliability of LLM evaluation by making it difficult to distinguish genuine generalization from memorization.To address this challenge, existing training data detectors aim to identify clean (unseen) data within potentially contaminated test sets. While effective to some extent, these methods often misclassify contaminated data as clean due to the black-box nature of LLMs, resulting in residual contamination and unreliable evaluation. This raises a key question: Can we control the proportion of contaminated data mistakenly identified as clean i.e., false discovery rate (FDR), below a user-specified threshold, while maximizing the amount of clean data retained for evaluation? Thus, we propose TD4Eval, a principled framework for training data detection that simultaneously ensures strict FDR control and high detection power. Specifically, we propose a rejection-count-based adaptive weighting strategy that learns the relative contribution of each detector. Based on these weights, we integrate multiple complementary detectors and apply the Benjamini-Hochberg (BH) procedure to control the FDR. Theoretically, we show that TD4Eval achieves asymptotic optimality in controlling FDR and maintaining high power. Empirical results on real-world datasets demonstrate that TD4Eval achieves an average 30\% improvement in FDR over SOTA methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16521
Loading