Keywords: Reinforcement Learning; Unsupervised Learning
Abstract: Recent advances in Large language models (LLMs) proves that Reinforcement Learning with Verifiable Rewards (RLVR) can enhance the reasoning capabilities of LLMs only with automatically verifiable signals. However, it is still a challenging and labor-consuming task to collect the ground truth answers for reasoning model, specially in the resource-constrained scenario. In this paper, we investigates the potential of unsupervised reinforcement learning with verifiable reward, and propose uns-GRPO framework to improve math reasoning of small LLMs. Firstly, we design an unsupervised reward model by generating pseudo answers by first repeat criterion. It treats the first repeated answer in a sequence of generated responses as the ground-truth, which is demonstrated with efficiency and reliability in resource-constrained settings. Secondly, we propose an adaptive KL regularization to mitigate the noise introduced by pseudo answer. A unique consistency is observed between pseudo answer confidence and accuracy rewards.Adjusting by the accuracy rewards, the adaptive KL regularization enforces conservative optimization when the confidence is low, while encourages diverse exploration when the confidence is high. Experimental results demonstrate that our un-supervised approach achieves stable improvement across diverse models, training datasets, and evaluation tasks.
Primary Area: reinforcement learning
Submission Number: 7700
Loading