Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang; Tu Nguyen; Matthieu Zimmer

Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang, Tu Nguyen, Matthieu Zimmer

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, reasoning, reinforcement learning, advantage estimation

TL;DR: Tree-OPO goes beyond standard RPO variants, introducing a principled off-policy framework where Structured Advantage Estimation enforces tree consistency and improves the quality of policy gradients.

Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories—traditionally used for training value or reward models—can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We reframe GRPO into a staged training paradigm, leveraging a teacher's MCTS rollouts to construct a tree-structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low-variance, prefix-aware advantages by projecting rewards onto a constraint set that respects the tree's hierarchy. Our empirical results show that SAE improves final accuracy over standard GRPO on mathematical reasoning tasks. This finding is supported by our theoretical analysis—which proves SAE reduces gradient variance for improved sample efficiency—and is demonstrated using both efficient heuristics and a formal quadratic program.

Submission Number: 35

Loading