Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation

ACL ARR 2025 May Submission5680 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the era of evaluating large language models (LLMs), data contamination has become an increasingly prominent concern. To address this risk, LLM benchmarking has evolved from a \textit{static} to a \textit{dynamic} paradigm. In this work, we conduct an in-depth analysis of existing \textit{static} and \textit{dynamic} benchmarks for evaluating LLMs. We first examine methods that enhance \textit{static} benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating \textit{dynamic} benchmarks. Based on this observation, we propose a series of optimal design principles for \textit{dynamic} benchmarking and analyze the limitations of existing \textit{dynamic} benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, benchmarking, evaluation methodologies, evaluation, automatic evaluation of datasets

Contribution Types: Surveys

Languages Studied: English

Submission Number: 5680

Loading