Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation
Abstract: In the era of evaluating large language models (LLMs), data contamination has become an increasingly prominent concern. To address this risk, LLM benchmarking has evolved from a \textit{static} to a \textit{dynamic} paradigm.
In this work, we conduct an in-depth analysis of existing \textit{static} and \textit{dynamic} benchmarks for evaluating LLMs.
We first examine methods that enhance \textit{static} benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating \textit{dynamic} benchmarks. Based on this observation, we propose a series of optimal design principles for \textit{dynamic} benchmarking and analyze the limitations of existing \textit{dynamic} benchmarks.
This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts.
We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, benchmarking, evaluation methodologies, evaluation, automatic evaluation of datasets
Contribution Types: Surveys
Languages Studied: English
Submission Number: 5680
Loading