Keywords: llm evaluators, test-time scaling
Abstract: Language model (LM) evaluators that generate chain-of-thought (CoT) reasoning are widely used for the assessment of LM responses.
Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven to be an effective technique for solving challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by scaling test-time compute? To answer this, we investigate employing reasoning models - LMs that natively generate long CoT reasoning - as evaluators. We explore scaling evaluation-time compute by using reasoning models to evaluate both the overall candidate response (i.e., outcome evaluation) and the individual reasoning steps within it (i.e., process evaluation). We observe that evaluator performance improves monotonically with the number of reasoning tokens generated, mirroring trends seen in LM reasoning. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as increasing compute during generation for improving an LM's problem-solving performance.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: automatic evaluation, inference methods
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7550
Loading