Keywords: Critic-free RL, Grouping-based Optimization, Static Value Estimate, Group Sampling, Agentic Reasoning, LLMs
Abstract: Grouping-based methods have emerged as a significant frontier in Reinforcement Learning (RL), yet agentic reasoning poses a fundamental challenge for grouping-based methods: frequent environmental interactions and multi-step tool invocation generate highly variable trajectories, rendering intra-group advantage estimation unstable.
In response, practitioners resort to excessive rollouts to stabilize training, which in turn incurs prohibitive computational costs.
This negative feedback loop between advantage estimation instability and sampling inefficiency severely limits learning performance.
We present PVPO, a stable and efficient critic-free RL framework that breaks this cycle through a pre-estimated value baseline and pre-sampled data filtering.
Specifically, before training begins, PVPO performs a single round of rollouts to compute two signals: (1) Static V, a Monte Carlo estimate of the expected return that serves as a fixed baseline to stabilize advantage estimation; and (2) sample-level accuracy, as a difficulty metric to filter out trivial samples and inject ground-truth trajectories into hard ones, thereby enhancing training efficiency.
As shown in Figure 1, experiments demonstrate that PVPO outperforms other grouping-based methods in both multi-step retrieval tasks and advanced mathematical reasoning benchmarks.
Notably, our 7B model trained with PVPO matches or exceeds the performance of large language models (LLMs).
Moreover, PVPO achieves a 2.5x speedup in training time compared to prior methods while maintaining comparable final performance.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: reinforcement learning in agents, tool use, environment interaction, function calling, LLM agents,
Contribution Types: Theory
Languages Studied: English
Submission Number: 9063
Loading