Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Anonymous

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: We propose Multi-LogiEval, a comprehensive evaluation benchmark encompassing multi-step logical reasoning covering three logic types and more than 50 of inference rule combinations.

Abstract: As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of benchmarks for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation benchmark encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types—proportional, first-order, and non-monotonic—consisting of more than 15 inference rules and more than 50 of their combinations. Leveraging this benchmark, we conduct evaluations on a range of LLMs such as GPT-4, ChatGPT, GPT-3, Llama-2, and FLAN-T5, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~43% at depth-1 to ~22% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at https://anonymous.4open.science/r/Multi_LogicEval-0545.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

0 Replies

Loading