Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath; Shashwat Goel; Thomas Foster; Parag Jain; Suchin Gururangan; Cheng Zhang; Anirudh Goyal; Alan Schelten

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Post-Training, RL Fine-Tuning, Large Language Models

TL;DR: CaT turns inference compute into supervision by synthesizing rollouts into estimated references via the initial policy, enabling reference-free RL with programmatic verification or self-proposed rubrics (up to +33% on MATH-500, +30% on HealthBench).

Abstract: Where do learning signals come from when there is no ground truth in post-training? We propose turning exploration into supervision through Compute as Teacher (CaT), which converts the model's own exploration at inference-time into reference-free supervision by synthesizing a single reference from a group of parallel rollouts and then optimizing toward it. Concretely, the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles omissions and contradictions to estimate a reference, turning extra inference-time compute into a teacher signal. We also offer a way to turn such an estimated reference, generated with any inference method, into rewards in two regimes: (i) verifiable tasks use programmatic equivalence on final answers; (ii) non-verifiable tasks use self-proposed rubrics—binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied. Unlike selection methods (best-of-$N$, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts. As a test-time procedure, CaT provides large relative improvements on three instruction-tuned models: Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B (up to +27% on MATH-500; +12% on HealthBench). With reinforcement learning (CaT-RL), we obtain further gains (up to +33% and +30%) while using $9\times$ less test-time compute, with the trained policy surpassing the initial teacher.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7738

Loading