Static Benchmarks Are Broken: The Case for Dynamic Evaluation of LLMs

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, benchmarks, evaluations, reliability
TL;DR: We argue that static benchmarks are flawed as they measure memorization instead of knowledge; dynamic evaluations synthetically generated at test time are the remedy to addressing this problem.
Abstract: Static, deterministic benchmarks have become the primary tool for measuring large language model (LLM) progress, yet growing evidence suggests they measure memorization rather than genuine capability. Performance on canonical benchmarks such as MMLU and GSM8k degrades sharply under semantics-preserving perturbations, including answer reordering, surface rephrasing, and distractor addition, revealing brittle pattern matching rather than robust understanding. We argue this fragility is not an implementation flaw but a structural consequence of fixed evaluation sets in the era of web-scale training. We advocate for dynamic, synthetically generated benchmarks constructed fresh at evaluation time, making contamination impossible by construction and enabling principled, reproducible evaluation of genuine model capability.
Submission Number: 15
Loading