A ∧ B ⇔ B ∧ A: Evaluating and Improving Logical Reasoning Ability of Large Language Models

A ∧ B ⇔ B ∧ A: Evaluating and Improving Logical Reasoning Ability of Large Language Models

ACL ARR 2024 April Submission808 Authors

16 Apr 2024 (modified: 22 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite their prowess in tasks like writing assistance, code generation, and machine translation, assessing LLMs' ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 25% to 94% across different models. Moreover, we leverage these findings to construct targeted demonstration examples for in-context learning, notably enhancing logical reasoning in models like GPT-4 by up to 10%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs' formal reasoning capabilities. We will make our code, data, and results publicly available to facilitate further research and replication of our findings.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Large Language Models, Logical Reasoning, Minimum Functionality Test

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 808

Loading