Evaluating LLMs at Temporal Generalization and Bias

ACL ARR 2024 April Submission903 Authors

16 Apr 2024 (modified: 19 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies that keep pace with improvements in language comprehension and information processing. Traditional benchmarks, which are often static, fail to capture the continually changing information landscape, leading to a disparity between the perceived and actual effectiveness of LLMs in ever-changing real-world scenarios. Furthermore, these benchmarks do not adequately measure the models' capabilities over a broader temporal range or their adaptability over time. We examine current LLMs in terms of temporal generalization and bias, revealing that various temporal biases emerge in both language likelihood and prognostic prediction. This serves as a caution for LLM practitioners to pay closer attention to mitigating temporal biases. Also, we propose an evaluation framework \model~ for dynamically generating benchmarks from the most recent real-world prognostication prediction. Our code and dataset will be released soon.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: temporaral generalization, evaluation method, large language model
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 903
Loading