Abstract: The evaluation of reasoning capabilities is crucial for the advancement of Artificial General Intelligence. While Large Language Models (LLMs) demonstrate proficiency in reasoning tasks, existing benchmarks such as GSM8K and LogiQA are limited, focusing mainly on individual problem-solving with linear logic and static conditions. To bridge this gap, we introduce an automated data construction pipeline that simulates real-world reasoning scenarios by combining existing reasoning problems into more complex, long-chain reasoning problems. Based on this pipeline, a new benchmark, LREval, is designed to assess comprehensive reasoning skills, such as multi-step logical deduction, integration of diverse information sources, and dynamic decision-making. The evaluations underscore huge reasoning challenges faced by LLMs. Closed-source models perform well in dynamic contexts but struggle with integrating information from multiple sources, while open-source models exhibit the opposite trend. Moreover, model performance is highly sensitive to perturbations in task conditions, revealing the fragility of reasoning capabilities in current LLMs and the necessity for robust evaluation frameworks. Additionally, models struggle with tasks requiring simultaneous comprehension of multiple languages, further emphasizing their limitations in multilingual understanding.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, multilingual benchmarks
Contribution Types: Data resources
Languages Studied: English, Chinese
Submission Number: 3252
Loading