t-BEN: A Temporal Logic Guided Approach for Temporal Reasoning Benchmark Generation

ICLR 2026 Conference Submission13708 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Temporal Logic, Temporal Reasoning, Benchmark
Abstract: In logic-based Artificial Intelligence (AI), temporal reasoning typically involves formalizing problems as logical rule expressions and employing symbolic reasoners to infer and derive new conclusions from structured knowledge. However, symbolic reasoners generally cannot process natural language directly and require manually constructed symbolic knowledge bases, which can be both time-consuming and resource-intensive to create and maintain. Given the recent widespread adoption of Large Language Models (LLMs) and their remarkable successes across diverse domains, we are motivated to explore to what extent LLMs can handle temporal logic tasks, dispensing with traditional symbolic reasoners. We introduce t-BEN, a benchmark suite that strictly adheres to the semantics of temporal logic. It automatically synthesizes temporal reasoning datasets in both symbolic and natural language forms, enabling the evaluation of Large Language Models (LLMs) on temporal logic reasoning. t-BEN is a highly scalable benchmark that supports the generation of datasets with varying sizes and rule structures of varying complexity. Furthermore, each question in t-BEN is guaranteed to be unseen by LLMs during pretraining, effectively minimizing the risk of data leakage. Our results, along with a detailed ablation study of seven frontier LLMs, offer valuable insights into the capabilities and limitations of current models in temporal logic reasoning tasks.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13708
Loading