ReST-RL: Reinforcing LLM Reasoning through Self-Training and Value-Guided Decoding

Sining Zhoubian; Dan Zhang; Jie Tang

ReST-RL: Reinforcing LLM Reasoning through Self-Training and Value-Guided Decoding

Sining Zhoubian, Dan Zhang, Jie Tang

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reinforcement Learning, Code Reasoning, Self Training

TL;DR: We propose a novel LLM RL framework that significantly improves LLM's code reasoning capabilities through optimized self-training and decoding.

Abstract: With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method — Group Relative Policy Optimization (GRPO) — has achieved critical success, yet it still suffers from the issue of insignificant reward variance. This paper introduces ReST-RL, a unified LLM RL paradigm that combines an improved GRPO algorithm with a meticulously designed test-time decoding method to improve LLM’s code reasoning ability. As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to increase the reward variance of GRPO sampling, thereby improving training effectiveness. Building on this foundation, we further introduce a test-time decoding optimization method, VM-MCTS, which employs an adapted Monte-Carlo Tree Search (MCTS) guided by a trained Value Model (VM) to provide precise process signals and verification scores, further enhancing LLM reasoning accuracy. We validate our RL paradigm on multiple coding benchmarks (e.g., APPS, BigCodeBench, and HumanEval), where it significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS), indicating its power to strengthen LLM’s reasoning capability. We further examine ReST-RL on out-of-domain math reasoning tasks, demonstrating that ReST-RL and the VM have strong transferability and generalizability across unseen reasoning domains and policy checkpoints, confirming that it extends beyond coding. Notably, our approach achieves strong performance with limited data, showcasing its effectiveness, efficiency, and generalizability.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 4645

Loading