Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Michael Bereket; Jure Leskovec

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Michael Bereket, Jure Leskovec

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: GRPO, Reinforcement Learning, Calibration, Reasoning, Language Models, LLM, RLVR, PPO, RLOO, Perturb-seq

Abstract: Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics and coding. However, it is unclear if current RL methods are similarly effective at optimizing language models to reason about the probability of uncertain events from stochastic data, a valuable capability for decision-making and scientific discovery. Here, we demonstrate that Group Relative Policy Optimization (GRPO) induces highly overconfident probability predictions across three proper scoring rule rewards, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why GRPO's biased advantage estimate causes overconfidence. Our results demonstrate the negative impact of GRPO's standard normalization on probabilistic prediction and highlight an important design consideration for RL algorithms: while unbiased advantage estimates provide a consistent optimization signal across tasks, biased advantage estimates must be aligned with the structure of the target objective to be effective.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 22475

Loading