Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha; Xian Li; Marjan Ghazvininejad; Jason E Weston; Tianlu Wang

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, Tianlu Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Lay Summary: The advancement of AI is hindered by the limitations of evaluation methods, but Large Language Models (LLMs) have emerged as a key solution by serving as effective evaluators. These models generate Chain-of-Thoughts (CoTs) that capture the step-by-step reasoning process underlying the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, training such LLM-as-a-Judge models that can think and reason have proved to be challenging. To that end, we develop a new preference optimization algorithm called EvalPlanner that helps LLMs think before producing judgments. Its thoughts consist of an evaluation plan comprised of all the necessary steps to evaluate responses specific to the given instruction and then an execution of that plan to arrive at the final judgment. EvalPlanner relies entirely on synthetic data, iteratively optimizing the model's plans and executions in a self-training loop. Our method achieved state-of-the-art performance on several benchmarks, outperforming other approaches despite being trained on less and synthetic data. By improving the evaluation process, we can create more reliable and trustworthy AI systems that can have significant implications for applications where accurate evaluation is critical, such as education, hiring, and content moderation.

Primary Area: Deep Learning->Large Language Models

Keywords: LLM-as-a-Judge, Planning, Reasoning, Evaluation

Submission Number: 12919

Loading