Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLVR, Value Distribution, LLM
TL;DR: We propose DistRLVR, a distributional RL framework for LLM post-training that models token-level return distributions and exploits tail advantages to achieve 24.1% relative improvement over PPO on math reasoning benchmarks.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become prevalent for LLM post-training, yet its reward signal is often terminal and near-binary, yielding prompt-conditional return distributions that are frequently long-tailed. Standard scalar critics usually adopted in RLVR obscure the distributional return structures and attenuate tail information, leading to less informative advantages and reduced optimization stability. To this end, we focus on modeling the return distribution for LLM RL fine-tuning and propose \textsc{DistRLVR}, a unified distributional RLVR framework that learns a critic with both categorical and quantile distributions. To stabilize distribution learning under long-horizon and terminal-sparse rewards, we introduce dual Sample-Replacement targets to diversify supervision. Building on the learned return distributions, we develop tail-aware advantage shaping that selectively amplifies informative tails. Across a range of mathematical reasoning benchmarks, \textsc{DistRLVR} delivers consistent gains in sample efficiency, Pass@$k$ and average performance, achieving a 24.1\% overall improvement over PPO. These results suggest that exploiting distributional structure is a practical and promising direction for more reliable RLVR post-training.
Submission Number: 109
Loading