Keywords: LLM, self-training, RL, unsupervised learning, self-penalization
Abstract: Reinforcement learning (RL) with human annotated data has boosted long chain-
of-thought (CoT) reasoning in large language models (LLMs), but these gains
come at high costs in labeled data while still faltering on harder tasks. A nat-
ural next step is experience-driven learning, where models improve without cu-
rated labels by adapting to unlabeled data. We introduce REinforcement learning
with Self-resTRAINt training (RESTRAIN), a self-penalizing RL framework that
transforms the absence of gold labels into a learning signal. Rather than amplify-
ing spurious majority votes, RESTRAIN leverages signals from the model’s entire
answer distribution, penalizing overconfident rollouts and low-consistent exam-
ples while preserving promising reasoning chains. This restraint mechanism inte-
grates seamlessly into policy optimization methods such as GRPO to self-improve
without human supervisions. On challenging reasoning benchmarks, RESTRAIN
delivers large gains using only unlabeled data. On Qwen3-4B-Base and Octo-
Thinker Hybrid-8B-Base model, RESTRAIN boosts pass@1 by up to +140.7% on
AIME25, +36.2% on MMLU STEM, and +19.6% on GPQA-Diamond. Re-
markably, it comes within 0.4% of a fully supervised counterpart, nearly matching
gold-label training while using no gold labels at all. These results demonstrate that
RESTRAIN consistently boosts reasoning without supervision.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19755
Loading