LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar; Neeraj Varshney; Nisarg Patel; Man Luo; Santosh Mashetty; Arindam Mitra; Chitta Baral

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar, Neeraj Varshney, Nisarg Patel, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Logical Reasoning, Large Language Models, Prompting

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present LogicBench, a systematically created question-answering natural language dataset for logical reasoning, encompassing 25 reasoning patterns spanning across three logic types, namely, propositional, first-order, and non-monotonic logics.

Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and a number of reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. To enable systematic evaluation of logical reasoning, we introduce LogicBench, a natural language question-answering dataset encompassing 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. Key steps of our dataset construction consist of (1) controlled generation of sentences and their negations containing different ontologies, (2) (context, question, answer) triplets creation using heuristically designed templates, and (3) semantic variations of triplets adding more diversity. We present a comprehensive evaluation with a range of LLMs such as GPT-4, GPT-3, ChatGPT, and FLAN-T5 using chain-of-thought prompting in both zero-shot and few-shot settings. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle on instances requiring complex reasoning steps. Furthermore, we also show that LLMs trained using our data exhibit a better understanding of logical reasoning leading to performance improvements on several existing logical reasoning datasets such as LogicNLI, FOLIO, LogiQA, and ReClor.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6607

Loading