Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath; Shashwat Goel; Thomas Foster; Parag Jain; Suchin Gururangan; Cheng Zhang; Anirudh Goyal; Alan Schelten

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten

Published: 03 Mar 2026, Last Modified: 10 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Rubrics, RL

TL;DR: CaT turns inference compute into supervision for reference-free RL. Self-proposed rubrics enable training in non-verifiable domains without annotations. CaT improves by up to +30% on HealthBench at 9× less compute compared to test-time scaling.

Abstract: Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels—critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework *Compute as Teacher (CaT)* and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call *synthesis*, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9× less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

Submission Number: 17

Loading