Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

ICLR 2026 Conference Submission21150 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm evaluators, evaluation-time scaling, test-time scaling, process evaluation

TL;DR: We demonstrate that the evaluator's performance monotonically improves when scaling evaluation-time compute with process evaluation, and in return, this gain can be translated into enhancing the generator's performance.

Abstract: Language model (LM) evaluators that generate chain-of-thought (CoT) reasoning are widely used for the assessment of LM responses. Simultaneously, increasing LMs’ “thinking” time through scaling test-time compute has proven to be an effective technique for solving challenging problems in domains such as math and code. This raises a natural question: can an LM’s evaluation capability also be improved by scaling test-time compute? To answer this, we investigate employing reasoning models – LMs that natively generate long CoT reasoning – as evaluators. We explore scaling evaluation-time compute by using reasoning models to evaluate both the overall candidate response (i.e., outcome evaluation) and the individual reasoning steps within it (i.e., process evaluation). In our experiments, we observe that evaluator performance improves monotonically with the number of reasoning tokens generated, mirroring trends seen in LM reasoning. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as increasing compute during generation for improving an LM’s problem-solving performance.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21150

Loading