Abstract: Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: event extraction, robustness, probing, pre-training, fine-tuning, benchmarking, corpus creation, evaluation methodologies, lexical semantic change, paraphrasing
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 2133
Loading