RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization

RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization

ICLR 2026 Conference Submission19755 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, self-training, RL, unsupervised learning, self-penalization

Abstract: Reinforcement learning (RL) with human annotated data has boosted long chain- of-thought (CoT) reasoning in large language models (LLMs), but these gains come at high costs in labeled data while still faltering on harder tasks. A nat- ural next step is experience-driven learning, where models improve without cu- rated labels by adapting to unlabeled data. We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN), a self-penalizing RL framework that transforms the absence of gold labels into a learning signal. Rather than amplify- ing spurious majority votes, RESTRAIN leverages signals from the model’s entire answer distribution, penalizing overconfident rollouts and low-consistent exam- ples while preserving promising reasoning chains. This restraint mechanism inte- grates seamlessly into policy optimization methods such as GRPO to self-improve without human supervisions. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. On Qwen3-4B-Base and Octo- Thinker Hybrid-8B-Base model, RESTRAIN boosts pass@1 by up to +140.7% on AIME25, +36.2% on MMLU STEM, and +19.6% on GPQA-Diamond. Re- markably, it comes within 0.4% of a fully supervised counterpart, nearly matching gold-label training while using no gold labels at all. These results demonstrate that RESTRAIN consistently boosts reasoning without supervision.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19755

Loading