LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models

07 Apr 2026 (modified: 20 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely LTLBench, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=02QRC2Cuu0&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: We have substantially revised the paper to adress the concerns raised by the reviewers. The main updates are summarized as follows: 1. **More up-to-date LLMs are evaluted**: Originally we only evaluted 6 LLMs. We now evaluate 12 LLMs, including more recent models such as DeepSeek V3, more GPT series models, more Qwen3 series models, and others, providing a more comprehensive and up-to-date evaluation of temporal reasoning (TR) abilities of recent LLMs; 2. **More methods included for evaluation**: Originally we only used direct prompting. We now include five prompting methods, i.,e., Direct Prompting, Zero-Shot CoT, Few-Shot CoT, Self-Consistency, and Least-to-Most, offering more insights of TR abilities of LLMs under different methods; 3. **More qualitative analysis for main experiments**: Originally we did not conduct extensive qualitative analysis for their failures in their temporal reasoning processes. We now provide a detailed qualitative analysis and reveal 3 main issues in their temporal reasoning processes; 4. **More qualitative analysis for the additional two experiments**: Originally, we did not analyze why their TR performances oscillate as the problems complexity increases in terms of the increases of events and operators. We now offer a detailed qualitative analysis to reveal their unexpected performance changes observed as the problem complexity increases.
Assigned Action Editor: ~Guillaume_Rabusseau1
Submission Number: 8295
Loading