Self-Evolving Rubrics: Interpretable Instance-Level Criteria for Scalable RL

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI SpotlightEveryoneRevisionsCC BY 4.0
Keywords: Self-improving rubrics, LM co-evolution, RL with non-verifiable rewards
Abstract: Rubric-based evaluation provides interpretable reward signals for language model training, but current methods obtain rubrics by prompting, producing criteria that sound reasonable rather than criteria that actually discriminate response quality. We propose training rubric generators via policy gradient with a binary discriminative reward: a rubric receives reward 1 if it causes a frozen judge to rank a preference pair correctly, 0 otherwise. This is the first method to co-evolve rubric generation with policy training through alternating updates. We study two training regimes: sequential (rubric trained alone, then frozen for policy training) and co-evolving (rubric and policy trained jointly with alternating updates). Co-evolving training substantially outperforms prompted baselines across nine instruction-following benchmarks (45.7 avg vs. 42.9 for GPT-4.1), demonstrating that learned rubrics enable small local judges to match or exceed large prompted models. Co-evolving also provides clear gains over sequential training (45.7 vs. 42.5 avg), enabled by stronger preference signals that require policy evolution. An unexpected finding reveals that small judges (0.6–1.7B) produce substantially better policies than large judges (4–32B): the smallest judge outperforms the largest by +25 points on GSM8K, suggesting that rubric-based evaluation is most effective when judges depend on explicit criteria rather than relying on internal capabilities.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 112
Loading