Self-Evolving Language Models through Co-evolved Discriminative Rubrics

Shuyue Stella Li; Rui Xin; Yike Wang; Teng Xiao; Rulin Shao; Zoey Hao; Melanie Sclar; Faeze Brahman; Pang Wei Koh; Yulia Tsvetkov

Self-Evolving Language Models through Co-evolved Discriminative Rubrics

Shuyue Stella Li, Rui Xin, Yike Wang, Teng Xiao, Rulin Shao, Zoey Hao, Melanie Sclar, Faeze Brahman, Pang Wei Koh, Yulia Tsvetkov

Published: 05 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop RSI SpotlightEveryoneRevisionsCC BY 4.0

Keywords: Self-improving models, LM co-evolution, RL with non-verifiable rewards

Abstract: We introduce a self-evolving training framework for language models that requires no human annotation, external reward model, or stronger teacher. A rubric generator produces instance-specific natural language criteria for each question; a small frozen judge scores policy responses against them. Factoring evaluation this way redirects reward signal quality from judge capability to rubric content, making effective training signal possible from a frozen judge as small as 0.6B. The rubric generator is trained via GRPO with a binary discriminative reward: 1 if its criteria cause the frozen judge to correctly rank a preference pair, 0 otherwise. Preference pairs are constructed entirely from the policy’s own outputs; in co-evolving training, the rubric generator and policy are updated in alternation, with a replay buffer that pairs current responses against earlier ones. On twelve benchmarks spanning reasoning, code, knowledge, and instruction following, co-evolving training outperforms GPT-4.1-prompted rubrics (69.5 vs. 66.7 avg) and sequential training (68.4–68.7 avg), with a 0.6B judge achieving 70.0 avg. The learned rubric transfers to Qwen3-14B without retraining (75.8 avg). Trained rubrics encode solutions as verifiable criteria, reducing evaluation to verification and explaining why self-evolution remains effective with judges far smaller than the policy.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 112

Loading