Quantile Advantage Estimation for Entropy-Safe Reasoning

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLVR, LLM reasoning, entropy explosion, advantage estimation
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean-baseline used in value-free RL (\eg GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise $K$-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries ($p \le 1{-}K$) it reinforces rare successes, while on easy queries ($p > 1{-}K$) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned $K$, roughly 80\% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23. These results identify {baseline design}—rather than token-level heuristics—as the primary mechanism for scaling RLVR.
Primary Area: reinforcement learning
Submission Number: 5107
Loading