Self-Evolving Language Models through Co-evolved Discriminative Rubrics

Published: 05 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop RSI SpotlightEveryoneRevisionsCC BY 4.0
Keywords: Self-improving models, LM co-evolution, RL with non-verifiable rewards
Abstract: We introduce a self-evolving training framework for language models that requires no human annotation, external reward model, or stronger teacher. A rubric generator produces instance-specific natural language criteria for each question; a small frozen judge scores policy responses against them. Factoring evaluation this way redirects reward signal quality from judge capability to rubric content, making effective training signal possible from a frozen judge as small as 0.6B. The rubric generator is trained via GRPO with a binary discriminative reward: 1 if its criteria cause the frozen judge to correctly rank a preference pair, 0 otherwise. Preference pairs are constructed entirely from the policy’s own outputs; in co-evolving training, the rubric generator and policy are updated in alternation, with a replay buffer that pairs current responses against earlier ones. On twelve benchmarks spanning reasoning, code, knowledge, and instruction following, co-evolving training outperforms GPT-4.1-prompted rubrics (69.5 vs. 66.7 avg) and sequential training (68.4–68.7 avg), with a 0.6B judge achieving 70.0 avg. The learned rubric transfers to Qwen3-14B without retraining (75.8 avg). Trained rubrics encode solutions as verifiable criteria, reducing evaluation to verification and explaining why self-evolution remains effective with judges far smaller than the policy.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 112
Loading