Time to Revisit Exact Match

ACL ARR 2025 May Submission6838 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Temporal question-answer (QA) is an established method to assess temporal reasoning in large language models (LLMs). Expected answers are often numeric (e.g., dates or durations), yet the model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal QA as a numerical estimation task to assess the shortcomings of EM. We introduce *TempAnswerQA*, a benchmark distilled from *Test of Time* and *TempTabQA*, where all questions require a numerical temporal answer, allowing us to evaluate models beyond EM. We used the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we found that error size and EM are decoupled. Models with low EM still had low sMAPE (both ~20\%), and some models had high sMAPE despite high EM. Scaling errors by the deviance of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models' understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models' most frequent error was to deviate only $\pm1$ from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: logical reasoning, reasoning
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 6838
Loading