Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Yilun Zhou; Austin Xu; PeiFeng Wang; Caiming Xiong; Shafiq Joty

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, Shafiq Joty

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel benchmark, JETTS, to evaluate LLM-as-judges from the perspective of their helpfulness as evaluators for generator's test-time scaling.

Abstract: Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

Lay Summary: Large language models (LLMs) are commonly used in various NLP tasks. One particular use case is to judge, or evaluate, the responses generated by other LLMs. This capability is often used to establish the relative quality of different LLMs in which they are asked to respond to a common set of queries. Orthogonally, an emerging use of judge models is to improve model responses. For example, for a single query (e.g., a math problem), the stochastic nature of an LLM allows it to generate multiple responses, with some correct and others incorrect. If LLM-judges can identify the quality of each response, then they should be able to select a correct one. In this paper, we systematically evaluate the helpfulness of LLM-judges in these settings, also known as test-time scaling. We propose the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge helpfulness in three different test-time scaling strategies across eight datasets in three domains (math reasoning, code generation and instruction following). We use JETTS to comprehensively assess the strengths and weaknesses of judge models, which both gives recommendations for machine learning practitioners and informs directions of future research.

Link To Code: https://github.com/salesforceairesearch/jetts-benchmark

Primary Area: Deep Learning->Large Language Models

Keywords: LLM-as-judge, test-time scaling, benchmark

Submission Number: 14314

Loading