t-BEN: A Temporal Logic Guided Approach for Temporal Reasoning Benchmark Generation

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Temporal Logic, Temporal Reasoning, Benchmark
Abstract: In logic-based Artificial Intelligence (AI), temporal reasoning typically involves formalizing problems as logical rule expressions and employing symbolic reasoners to infer and derive new conclusions from structured knowledge. However, symbolic reasoners generally cannot process natural language directly and require manually constructed symbolic knowledge bases, which can be both time-consuming and resource-intensive to create and maintain. Given the recent widespread adoption of Large Language Models (LLMs) and their remarkable successes across diverse domains, we are motivated to explore to what extent LLMs can handle temporal logic tasks, dispensing with traditional symbolic reasoners. We introduce $\texttt{t-BEN}$, a benchmark suite that strictly adheres to the semantics of temporal logic. It automatically synthesizes temporal reasoning datasets in both symbolic and natural language forms, enabling the evaluation of Large Language Models (LLMs) on temporal logic reasoning. $\texttt{t-BEN}$ is a highly scalable benchmark that supports the generation of datasets with varying sizes and rule structures of varying complexity. Furthermore, each question in $\texttt{t-BEN}$ is guaranteed to be unseen by LLMs during pretraining, effectively minimizing the risk of data leakage. Our results, along with a detailed ablation study of seven frontier LLMs, offer valuable insights into the capabilities and limitations of current models in temporal logic reasoning tasks. Our generated datasets are available at \url{https://huggingface.co/datasets/BochengZou/t-BEN}.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/BochengZou/t-BEN
Code URL: https://drive.google.com/file/d/17fLk6ev7kOCUGKZWtv-laRJHD5SMjEcX/view?usp=sharing
Supplementary Material: zip
Primary Area: Datasets & Benchmarks illustrating Different Deep learning Scenarios (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 1460
Loading