Time Waits for No Benchmark: Exploring the Temporal Misalignment between Static Benchmarks, Modern LLMs, and the Real World
Keywords: LLM evaluation, Benchmark
Abstract: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about the reliability of model assessments.
Substantial works used popular but old benchmarks to evaluate modern LLMs.
In this work, we explore the temporal misalignment between static benchmarks, modern LLMs, and the real world.
We develop a multistage pipeline to implement temporal comparison with 10 LLMs and 5 benchmarks.
Three quantitative scores are introduced to measure temporal misalignment.
Experimental analysis illustrates that there is significant temporal misalignment between widely-used benchmarks, present LLMs, and the real-world facts.
This temporal misalignment leads to untrustworthy LLM evaluations in a great number of existing and future works.
These findings underscore a fundamental temporal issue of using static benchmarks to evaluate modern LLMs, suggesting a further exploration of temporally aware strategies to ensure more reliable and robust LLM evaluation.
Submission Number: 85
Loading