Keywords: Reinforcement Learning, Large Language Models
Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample *HealthBench Easy* subset is sufficient to obtain a model that exceeds GPT-5 on *HealthBench Hard*. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 8382
Loading