Beyond Clean and Contaminated: A Survey on the Fundamental Properties of Data Contamination in Large Language Models
Keywords: Evaluation Methodologies; Data Contamination; Large Language Model
Abstract: Benchmark-based evaluation remains the primary mechanism for comparing large language models (LLMs), yet modern development pipelines increasingly blur the boundary between training and testing.
Beyond direct train--test overlap, contamination leaks through pathways such as post-training, evaluation-time ``test-set fitting,'' and retrieval-enabled tool use.
In this paper, we frame data contamination as an evaluation-validity failure mode and propose a three-dimensional taxonomy based on phase, granularity, and modality.
We argue that contamination is regime-dependent rather than binary, summarizing key properties such as inevitability under web-scale collection, scaling effects, and forgettability.
Building on these insights, we reorganize detection methods into two complementary paradigms: statistical approaches (quantifying inflation via observational signals) and causal approaches (verifying via controlled injection). Finally, we provide a critical discussion of these detection methodologies.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies
Contribution Types: Surveys
Languages Studied: English
Submission Number: 2390
Loading