Useless, or Untapped? Unlocking the Full Value of Zero-Advantage Samples for Better Policy Optimization
Keywords: Reinforcement Learning, LLM Reasoning, Zero-advantage Samples
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a key technology to enhance the reasoning capabilities of large language models (LLMs). Recent studies have identified that the widespread prevalence of zero-advantage samples significantly impairs the training efficiency of RLVR algorithms, as the associated gradient vanishing prohibits effective parameter updates. To mitigate this issue, prior work attempts to discard such samples before or after rollout to improve efficiency. However, the computational cost incurred in generating these samples remains unavoidable. In this paper, we propose a novel perspective to address this challenge: if zero-advantage samples cannot be avoided, then we should leverage them. Specifically, we propose ZAPO, a Zero-Advantage sample-augmented Policy Optimization method that activates zero-advantage samples and enables them to make unique contributions to policy updates. Specifically, we utilize entropy to provide additional reward signals for zero-advantage samples, restoring their advantages, and thereby accelerating training efficiency. Simultaneously, entropy-based rewards drives exploration of previously unconsidered reasoning paths and expands the model’s capability boundary. Experimental results on five math reasoning benchmarks and three base models (Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B) demonstrate that ZAPO achieves superior average reasoning performance (45.7%, 54.2% and 55.4%), while achieving training acceleration factors of 1.7×, 1.3× and 1.2× in three base models, respectively, validating the effectiveness of the proposed approach
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11351
Loading