How Far Can Unsupervised RLVR Scale LLM Training?

ICLR 2026 Conference Submission23785 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Unsupervised Reward, Reinforcement Learning, Reasoning
TL;DR: We revisit unsupervised RLVR through intrinsic rewards, unifying existing methods, analyzing their impact on confidence and failure modes, discussing their potential applications..
Abstract: Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR) offers a pathway for Large Language Models (LLMs) to improve without human supervision. Particularly, many works use model intrinsic information as rewards for URLVR, showing promising improvements, yet their potential and limitations remain unclear. In this work, we revisit URLVR through the lens of intrinsic rewards. We present a unified theoretical framework showing that intrinsic reward methods share a core mechanism: they trade uncertainty for performance by leveraging the model’s prior knowledge to sharpen output distributions. Empirical analysis confirms this tradeoff, revealing distinct failure modes and showing that collapse is not inevitable in small, domain-specific regimes such as test-time training. Beyond these findings, early intrinsic reward dynamics also provide a lightweight indicator of model-task priors, complementing $pass@k$ in assessing RL trainability. These insights highlight both the promise and pitfalls of URLVR, motivating future directions such as external rewards and hybrid supervision strategies.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23785
Loading