Keywords: LLM, self-training, RL, unsupervised learning, self-penalization
Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought
reasoning in large reasoning models, but these gains come at high costs in labeled
data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data.
We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN),
a self-penalizing RL framework that converts the absence of gold labels into a
useful learning signal. Instead of overcommitting to spurious majority votes,
RESTRAIN exploits signals from the model’s entire answer distribution: penalizing
overconfident rollouts and low-consistency examples while preserving promising
reasoning chains. This self-penalization mechanism integrates seamlessly into
policy optimization methods such as GRPO, enabling continual self-improvement
without supervision. On challenging reasoning benchmarks, RESTRAIN delivers
large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker
Hybrid-8B-Base, it boosts Pass@1 by up to +140.7% on AIME25, +36.2% on
MMLU STEM, and +19.6% on GPQA-Diamond, nearly matching gold-label
training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19755
Loading