tBen: Benchmarking and Testing the Rule-Based Temporal Logic Reasoning Ability of Large Language Models with DatalogMTL

Dingmin Wang; Bocheng Zou; Zhen Han; zhiqiang xu

tBen: Benchmarking and Testing the Rule-Based Temporal Logic Reasoning Ability of Large Language Models with DatalogMTL

Dingmin Wang, Bocheng Zou, Zhen Han, zhiqiang xu

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Temporal Logic Reasoning, Large Language Models, DatalogMTL

TL;DR: We have developed a new set of synthetic benchmarks for rule-based temporal logic reasoning and conducted extensive experiments and analysis.

Abstract: Large language models (LLMs) are increasingly adopted for a variety of tasks, including multi-hop question answering, knowledge probing, and symbolic commonsense reasoning. While LLMs have advanced the state-of-the-art in these areas, their ability to explicitly solve rule-based temporal logic reasoning problems—a complex cognitive process involving the understanding, representation, and manipulation of temporal information such as events, their durations, and relationships—remains unexplored. To enhance understanding of LLM performance in this common task widely explored in the traditional symbolic AI field, we have developed a new set of synthetic benchmarks for rule-based temporal logic reasoning tBen. Our tBen benchmarks are built within the context of DatalogMTL, a powerful knowledge representation language for reasoning about the properties of systems that evolve over time, in which we provide flexible configurations for customizing temporal rules and task complexity. We evaluated the close-sourced GPT-4o and the open-sourced Llama-3 using three common prompting settings—$\textit{zero-shot}$, $\textit{few-shot}$, and $\textit{zero-shot-CoT}$—on our synthetic benchmarks. Our key findings are as follows: (i) Without generating the reasoning process (chain-of-thought), even advanced LLMs like GPT-4o exhibited nearly random performance on these rule-based temporal logic reasoning tasks. However, with chain-of-thought prompting, LLMs demonstrated preliminary temporal logical reasoning abilities; (ii) Both GPT-4o and Llama-3 were unable to solve temporal logical reasoning problems involving recursion, indicating a lack of advanced complex reasoning capabilities in understanding symbolic representations involving time; (iii) There is significant room for improvement in leveraging large language models to address problems widely explored in the traditional logic-based AI domain. Prompts and datasets are available in the appendix, and a datasheet for tBen is also provided.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9084

Loading