Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data.
Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on
widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowl-
edge cut-off dates. We introduce evolveQA1, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
Loading