Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma; Andrei Cojocaru; Neel Kolhe; Robin Sharif; Haihan Zhang; Vincent Zhuang; Matei Zaharia; Sewon Min

Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Robin Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: automated proof evaluation; LLM-as-a-judge; LLM-generated math proofs; rubric-guided grading; prompt optimization; expert-annotated proof dataset; evaluator reliability; reward modeling

TL;DR: LLMs lack reliable proof evaluators. We introduce ProofBench and a 0–7 methodology; our ProofGrader (marking schemes + ensembling) hits RMSE 1.093 vs experts and lifts best-of-8 to 4.05/7, closing >90% of the gap to a human oracle.

Abstract: Recent advances in large language models (LLMs) for math reasoning have largely focused on tasks with easily verifiable final answers; however, generating natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0–7 scale to model-generated math proofs. We first introduce **ProofBench**, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1. With ProofBench, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers **ProofGrader**, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

Submission Number: 186

Loading