Abstract: Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose \textbf{S}elf-training with \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (\textbf{SPPD}). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a \textbf{dynamic value margin} for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, \textbf{eliminating the need for distillation.} We theoretically establish that SPPD is \textbf{equivalent to on-policy policy gradient methods} under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks. Our code is publicly available at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Reinforcement Learning, Process Preference Learning, Self Training
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2341
Loading