Keywords: Self rewarding, Reinforcement learning, noise learning
Abstract: Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), in which the policy model assigns reward signals to its own rollouts, enables sustainable scaling in unlabeled settings. Yet its performance and stability still lag behind RLVR. We trace this gap to a system bias: the model tends to deem its own high-confidence rollouts correct, leading to biased and unstable reward estimation. It accumulates and rises rapidly as training proceeds, with the deviation from the oracle drifting toward over-reward. This causes unstable training and locks the performance ceiling. To understand how system bias yields these effects, we characterize it by the magnitude of reward bias, the degree of policy–reward coupling, and the proportional imbalance between over-reward and under-reward via three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ affect convergence performance and speed, while $\rho_{\text{selfbias}}$ has an amplification effect: it amplifies both correct and incorrect updates and induces unstable reward estimation. To mitigate system bias of RLIR, we propose reinforcement learning with ensembled rewards (RLER). It aggregates diverse models with adaptive reward interpolation and rollout selection strategy to build a unified reward-estimation space, jointly improving accuracy ($\rho_{\text{noise}}$), unbiasedness ($\rho_{\text{selfbias}}$, $\rho_{\text{symbias}}$), and robustness ($\rho_{\text{selfbias}}$). Extensive experiments show that RLER improves by +13.6\% over the best RLIR baseline, and is only 3.6\% below the RLVR setting. Moreover, RLER achieves stable scaling on unlabeled samples, making it highly applicable.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7509
Loading