PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

ACL ARR 2026 January Submission9063 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Critic-free RL, Grouping-based Optimization, Static Value Estimate, Group Sampling, Agentic Reasoning, LLMs

Abstract: Grouping-based methods have emerged as a significant frontier in Reinforcement Learning (RL), yet agentic reasoning poses a fundamental challenge for grouping-based methods: frequent environmental interactions and multi-step tool invocation generate highly variable trajectories, rendering intra-group advantage estimation unstable. In response, practitioners resort to excessive rollouts to stabilize training, which in turn incurs prohibitive computational costs. This negative feedback loop between advantage estimation instability and sampling inefficiency severely limits learning performance. We present PVPO, a stable and efficient critic-free RL framework that breaks this cycle through a pre-estimated value baseline and pre-sampled data filtering. Specifically, before training begins, PVPO performs a single round of rollouts to compute two signals: (1) Static V, a Monte Carlo estimate of the expected return that serves as a fixed baseline to stabilize advantage estimation; and (2) sample-level accuracy, as a difficulty metric to filter out trivial samples and inject ground-truth trajectories into hard ones, thereby enhancing training efficiency. As shown in Figure 1, experiments demonstrate that PVPO outperforms other grouping-based methods in both multi-step retrieval tasks and advanced mathematical reasoning benchmarks. Notably, our 7B model trained with PVPO matches or exceeds the performance of large language models (LLMs). Moreover, PVPO achieves a 2.5x speedup in training time compared to prior methods while maintaining comparable final performance.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: reinforcement learning in agents, tool use, environment interaction, function calling, LLM agents,

Contribution Types: Theory

Languages Studied: English

Submission Number: 9063

Loading