Keywords: Logical Reasoning, Large Language Models, Prompting
TL;DR: To study logical reasoning, we present LogicBench, a systematically created natural language question-answering dataset encompassing 25 reasoning patterns spanning over propositional, first-order, and non-monotonic logics.
Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "Reason" over the natural language? This question has been receiving significant research attention and a number of reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability has focused only on a couple of axioms (such as modus ponens and modus tollens) of propositional and first-order logic. To study logical reasoning, we introduce LogicBench, a systematically created natural language question-answering dataset encompassing 25 reasoning patterns spanning over propositional, first-order, and non-monotonic logics. Key steps of our dataset construction consist of (1) controlled generation of sentences and their negations containing different ontologies, (2) (context, question, answer) triplets creation using heuristically designed templates, and (3) semantic variations of triplets adding more diversity. We first evaluate easily accessible and widely used LLMs such as GPT-3, ChatGPT, and FLAN-T5 and show that they do not fare well on LogicBench, achieving just above random accuracy on average (~52%). Then, we show that LLMs trained using our data exhibit a better understanding of logical reasoning leading to performance improvements on several existing logical reasoning datasets such as LogicNLI, FOLIO, LogiQA, and ReClor.
Supplementary Material: pdf
Submission Number: 662
Loading