LREval: Long-Chain Reasoning Evaluation for Large Language Models

LREval: Long-Chain Reasoning Evaluation for Large Language Models

ACL ARR 2025 February Submission3252 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The evaluation of reasoning capabilities is crucial for the advancement of Artificial General Intelligence. While Large Language Models (LLMs) demonstrate proficiency in reasoning tasks, existing benchmarks such as GSM8K and LogiQA are limited, focusing mainly on individual problem-solving with linear logic and static conditions. To bridge this gap, we introduce an automated data construction pipeline that simulates real-world reasoning scenarios by combining existing reasoning problems into more complex, long-chain reasoning problems. Based on this pipeline, a new benchmark, LREval, is designed to assess comprehensive reasoning skills, such as multi-step logical deduction, integration of diverse information sources, and dynamic decision-making. The evaluations underscore huge reasoning challenges faced by LLMs. Closed-source models perform well in dynamic contexts but struggle with integrating information from multiple sources, while open-source models exhibit the opposite trend. Moreover, model performance is highly sensitive to perturbations in task conditions, revealing the fragility of reasoning capabilities in current LLMs and the necessity for robust evaluation frameworks. Additionally, models struggle with tasks requiring simultaneous comprehension of multiple languages, further emphasizing their limitations in multilingual understanding.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, multilingual benchmarks

Contribution Types: Data resources

Languages Studied: English, Chinese

Submission Number: 3252

Loading