Beyond Clean and Contaminated:  A Survey on the Fundamental Properties of Data Contamination in Large Language Models

Beyond Clean and Contaminated: A Survey on the Fundamental Properties of Data Contamination in Large Language Models

ACL ARR 2026 January Submission2390 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Evaluation Methodologies; Data Contamination; Large Language Model

Abstract: Benchmark-based evaluation remains the primary mechanism for comparing large language models (LLMs), yet modern development pipelines increasingly blur the boundary between training and testing. Beyond direct train--test overlap, contamination leaks through pathways such as post-training, evaluation-time ``test-set fitting,'' and retrieval-enabled tool use. In this paper, we frame data contamination as an evaluation-validity failure mode and propose a three-dimensional taxonomy based on phase, granularity, and modality. We argue that contamination is regime-dependent rather than binary, summarizing key properties such as inevitability under web-scale collection, scaling effects, and forgettability. Building on these insights, we reorganize detection methods into two complementary paradigms: statistical approaches (quantifying inflation via observational signals) and causal approaches (verifying via controlled injection). Finally, we provide a critical discussion of these detection methodologies.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies

Contribution Types: Surveys

Languages Studied: English

Submission Number: 2390

Loading