Keywords: GRPO, Reinforcement Learning, Calibration, Reasoning, Language Models, LLM, RLVR, PPO, RLOO, Perturb-seq
Abstract: Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics and coding. However, it is unclear if current RL methods are similarly effective at optimizing language models to reason about the probability of uncertain events from stochastic data, a valuable capability for decision-making and scientific discovery. Here, we demonstrate that Group Relative Policy Optimization (GRPO) induces highly overconfident probability predictions across three proper scoring rule rewards, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why GRPO's biased advantage estimate causes overconfidence. Our results demonstrate the negative impact of GRPO's standard normalization on probabilistic prediction and highlight an important design consideration for RL algorithms: while unbiased advantage estimates provide a consistent optimization signal across tasks, biased advantage estimates must be aligned with the structure of the target objective to be effective.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22475
Loading