Natural Language Reasoning in Large Language Models: Analysis and Evaluation

ACL ARR 2024 December Submission786 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the growing focus on reasoning in Large Language Models (LLMs), particularly through techniques like Chain-of-Thought prompting, there remains limited analysis on whether these models are really reasoning or if performance improvements are mainly due to the context added to the prompt. Furthermore, there is a lack of advanced evaluation tasks assessing natural language reasoning in generative models. This paper addresses a gap in the study of reasoning in LLMs by presenting the first large-scale evaluation of their unconstrained natural language reasoning capabilities based on natural language argumentation. As a result, three main contributions are produced: (i) the formalisation of a new strategy designed to evaluate argumentative reasoning understanding in LLMs: argument-component selection; (ii) the creation of the Argument Reasoning Tasks (ART) dataset, a new benchmark based on argument structures for natural language reasoning; and (iii) an extensive experimental analysis involving four different models, pointing out consistently the important limitations of LLMs on natural language reasoning tasks.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Argument Mining, Reasoning, Large Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 786
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview