SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

ACL ARR 2025 May Submission2341 Authors

19 May 2025 (modified: 07 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose \textbf{S}elf-training with \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (\textbf{SPPD}). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a \textbf{dynamic value margin} for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, \textbf{eliminating the need for distillation.} We theoretically establish that SPPD is \textbf{equivalent to on-policy policy gradient methods} under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks. Our code is publicly available at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: Reinforcement Learning, Process Preference Learning, Self Training

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2341

Loading