Keywords: llm, reasoning, reinforcement learning, advantage estimation
TL;DR: Tree-OPO goes beyond standard RPO variants, introducing a principled off-policy framework where Structured Advantage Estimation enforces tree consistency and improves the quality of policy gradients.
Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories—traditionally used for training value or reward models—can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We reframe GRPO into a staged training paradigm, leveraging a teacher's MCTS rollouts to construct a tree-structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low-variance, prefix-aware advantages by projecting rewards onto a constraint set that respects the tree's hierarchy. Our empirical results show that SAE improves final accuracy over standard GRPO on mathematical reasoning tasks. This finding is supported by our theoretical analysis—which proves SAE reduces gradient variance for improved sample efficiency—and is demonstrated using both efficient heuristics and a formal quadratic program.
Submission Number: 35
Loading