Keywords: Mathematical Reasoning, Graph-Structured Reasoning, Chain-of-Thought, Reinforcement Learning, Step-level Optimization
Abstract: Despite recent progress, large language models (LLMs) for mathematical reasoning often exhibit fragile behaviors, where correct answers are produced despite invalid or incoherent intermediate reasoning. We identify two recurring structural pathologies in Chain-of-Thought (CoT) reasoning: disconnected steps, where intermediate results are not reused, and weak logical flow, where steps are loosely or incorrectly linked yet still yield correct answers. These failures are difficult to address under outcome-only supervision.
To mitigate these issues, we propose the Graph-structured Stepwise Reasoning Framework (GSRF), which reformulates implicit CoT into a Graph-structured Stepwise CoT (GS-CoT) that makes inter-step dependencies explicit. Building on this structure, we introduce Graph-guided Group Relative Policy Optimization (G-GRPO), incorporating process-level rewards that encourage step reuse and alignment with the final answer.
Extensive experiments on both textual and multimodal mathematical reasoning benchmarks demonstrate that GSRF achieves competitive performance while producing more faithful, coherent, and structurally grounded reasoning traces.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Mathematical Reasoning, Graph-Structured Reasoning, Chain-of-Thought, Reinforcement Learning, Step-level Optimization
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 9613
Loading