Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

ACL ARR 2026 January Submission8082 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Mathematical Reasoning, Benchmark, Process Evaluation, Structural Reasoning, Process Reward Model, Chain-of-Thought

Abstract: Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce \textsc{ReasoningMath-Plus}, a benchmark of 150 carefully curated problems explicitly designed to evaluate \emph{structural reasoning}. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce \textsc{HCRS} (\textbf{H}azard-aware \textbf{C}hain-based \textbf{R}ule \textbf{S}core), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to $5.8/10$), HCRS-based holistic evaluation yields substantially lower scores (average $4.36/10$, best $5.14/10$), showing that answer-only metrics can overestimate reasoning robustness.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Benchmarking, Datasets, Evaluation Methodologies

Contribution Types: Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 8082

Loading