Res-Bench: Reasoning Skill-Aware Reasoning Diagnostic evaluation benchmark for math reasoning

03 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, Mathmatical Reasoning
Abstract: Large Language Models (LLMs) have recently demonstrated strong performance on mathematical reasoning tasks, often evaluated solely by their ability to produce the correct final answer. However, this evaluation paradigm fails to capture whether models genuinely follow sound reasoning processes or rely on spurious shortcuts. In this paper, we introduce Res-Bench, a first fine-grained evaluation dataset for measuring the mathematical reasoning abilities of LLMs not only on final correctness but also on step-level reasoning quality and reasoning skill alignment. Specifically, Res-Bench consists of 3271 test samples, primarily focused on math problems aligned with Chinese middle- and high-school curricula, provided in English. Each test case is annotated by GPT-4 and verified by human experts with its decomposition into intermediate reasoning steps, and mapped to explicit reasoning skills. Based on Res-Bench, we further conduct extensive evaluation with a multi-dimensional evaluation protocol that measures: (1) final answer accuracy, (2) consistency and validity of intermediate steps, and (3) mastery over the required reasoning skills. Our experimental results across several state-of-the-art LLMs reveal that while models can often achieve high answer-level accuracy, their step-level reasoning exhibits significant inconsistencies and frequent misalignment with targeted reasoning skills. Our findings highlight the necessity of moving beyond final-answer evaluations and toward process-based assessment, providing deeper insights into LLMs’ reasoning capabilities.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1240
Loading