Rethinking LLM Judges: Chain-of-Thought and Multi-Step Pipelines for Math Grading

Published: 05 Mar 2026, Last Modified: 27 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: LLM judge, LLM evaluations, chain-of-thought, math reasoning evaluation, prompting, grading, judge reliability, score stability
TL;DR: Comparative prompting—a single-pass strategy comparing student solutions to references—consistently outperforms multi-step pipelines and unconstrained CoT for LLM math judging.
Abstract: LLM judges are promising for evaluating reasoned mathematical solutions, yet their scores can be prompt-sensitive, unstable, and opaque. Two assumptions are currently prevalent: that chain-of-thought (CoT) reasoning provides little or no improvement in agreement with human grades for LLM judges, and that multi-step pipelines---e.g., debate or ensembles---outperform single-step evaluation. We challenge both. For CoT, on Putnam-AXIOM-Grading and IMO-GradingBench, two human-graded competition-mathematics benchmarks, we find that *unconstrained* CoT often reduces *evaluation consistency* and doesn't significantly affect performance. In contrast, *deliberately structured* CoT recovers and often improves agreement with human grades relative to *single-pass CoT-absent scoring*. This pattern is strongest for reasoning-optimized models such as *DeepSeek-R1*. For multi-step pipelines, popular methods---G-Eval, Chain-of-Verification, and Debate---consistently underperform simpler strategies. Our most striking finding is that *comparative prompting*---a single-pass strategy that explicitly compares student solutions to reference answers---is the most consistently high-performing strategy we tested: it ranks in the top three for correlation with human grades on all five models, and achieves the best correlation for three of them. Our findings point to a simple prescription for difficult mathematical grading: compare against a reference, reason within constraints, and avoid unnecessary complexity.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 106
Loading