Keywords: automated proof evaluation; LLM-as-a-judge; LLM-generated math proofs; rubric-guided grading; prompt optimization; expert-annotated proof dataset; evaluator reliability; reward modeling
TL;DR: LLMs lack reliable proof evaluators. We introduce ProofBench and a 0–7 methodology; our ProofGrader (marking schemes + ensembling) hits RMSE 1.093 vs experts and lifts best-of-8 to 4.05/7, closing >90% of the gap to a human oracle.
Abstract: Recent advances in large language models (LLMs) for math reasoning have largely focused on tasks with easily verifiable final answers; however, generating natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap.
To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0–7 scale to model-generated math proofs.
We first introduce **ProofBench**, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1. With ProofBench, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow.
Our analysis delivers **ProofGrader**, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines.
Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
Submission Number: 186
Loading