Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose Multi-LogiEval, a comprehensive evaluation benchmark encompassing multi-step logical reasoning covering three logic types and more than 50 of inference rule combinations.
Abstract: As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of benchmarks for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation benchmark encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types—proportional, first-order, and non-monotonic—consisting of more than 15 inference rules and more than 50 of their combinations. Leveraging this benchmark, we conduct evaluations on a range of LLMs such as GPT-4, ChatGPT, GPT-3, Llama-2, and FLAN-T5, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~43% at depth-1 to ~22% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at https://anonymous.4open.science/r/Multi_LogicEval-0545.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview