Keywords: RLVR, LLM reasoning, entropy explosion, advantage estimation
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between {entropy collapse} and {entropy explosion}.
We trace both hazards to the mean-baseline used in value-free RL (\eg GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers.
We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise $K$-quantile baseline.
QAE induces a response-level, two-regime gate: on hard queries ($p \le 1{-}K$) it reinforces rare successes, while on easy queries ($p > 1{-}K$) it targets remaining failures.
Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse.
Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned $K$, roughly 80\% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23.
These results identify {baseline design}—rather than token-level heuristics—as the primary mechanism for scaling RLVR.
Primary Area: reinforcement learning
Submission Number: 5107
Loading