TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang; Zhijiang Guo; Yinya Huang; Yongxin Wang; Yiwei Wang; Xiaodan Liang; Jing Tang

TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

17 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, Reasoning with LLMs

TL;DR: We propose TreeRPO, a GRPO variants to training LLMs with dense process reward without using PRMs.

Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce TreeRPO, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, TreeRPO directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, TreeRPO innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows TreeRPO to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our TreeRPO algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0% to 35.5%. Furthermore, TreeRPO significantly outperforms GRPO by 2.9% in performance while simultaneously reducing the average response length by 18.1%, showcasing its effectiveness and efficiency.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9473

Loading