Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model reasoning, self-supervised RL
TL;DR: We propose Co-rewarding, a novel self-supervised RL framework that improves training stability for large language model reasoning.
Abstract: Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human-annotated labels especially for complex tasks. Recent self-rewarding methods provide a label-free alternative that exhibits the eliciting potential of LLM reasoning, but they often suffer from the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. We also explore their orthogonally combined version to further boost the performance. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses ground-truth (GT) labeled reward in several cases of RLVR, such as achieving a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11193
Loading