LogicBench: A Benchmark for Evaluation of Logical Reasoning

Mihir Parmar; Neeraj Varshney; Nisarg Patel; Santosh Mashetty; Man Luo; Arindam Mitra; Chitta Baral

LogicBench: A Benchmark for Evaluation of Logical Reasoning

Mihir Parmar, Neeraj Varshney, Nisarg Patel, Santosh Mashetty, Man Luo, Arindam Mitra, Chitta Baral

01 Jun 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023 Datasets and BenchmarksEveryoneRevisionsBibTeX

Keywords: Logical Reasoning, Large Language Models, Prompting

TL;DR: To study logical reasoning, we present LogicBench, a systematically created natural language question-answering dataset encompassing 25 reasoning patterns spanning over propositional, first-order, and non-monotonic logics.

Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "Reason" over the natural language? This question has been receiving significant research attention and a number of reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability has focused only on a couple of axioms (such as modus ponens and modus tollens) of propositional and first-order logic. To study logical reasoning, we introduce LogicBench, a systematically created natural language question-answering dataset encompassing 25 reasoning patterns spanning over propositional, first-order, and non-monotonic logics. Key steps of our dataset construction consist of (1) controlled generation of sentences and their negations containing different ontologies, (2) (context, question, answer) triplets creation using heuristically designed templates, and (3) semantic variations of triplets adding more diversity. We first evaluate easily accessible and widely used LLMs such as GPT-3, ChatGPT, and FLAN-T5 and show that they do not fare well on LogicBench, achieving just above random accuracy on average (~52%). Then, we show that LLMs trained using our data exhibit a better understanding of logical reasoning leading to performance improvements on several existing logical reasoning datasets such as LogicNLI, FOLIO, LogiQA, and ReClor.

Supplementary Material: pdf

Submission Number: 662

Loading