Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while improving Pass@1 through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches---whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes---treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors---incorrect reasoning paths that the RL process has spuriously reinforced---to persist and monopolize probability mass, suppressing valid exploratory trajectories.
We propose the Asymmetric Confidence-aware Error Penalty (ACE), which introduces a per-rollout confidence shift metric to dynamically modulate negative advantages. Theoretically, we show that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. Experiments fine-tune Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO with VERL, evaluating on MATH-500 and AIME 2025. ACE yields the strongest and most consistent gains on the two Qwen families, and on Llama-3.1-8B-Instruct ACE-GRPO delivers modest but consistent large-k gains over GRPO, indicating partial robustness beyond the primary model family.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Nino_Vieillard1
Submission Number: 8238
Loading