Keywords: Mathematical proof assessment, Automated grading, Olympiad mathematics LLM evaluation, Proof verification
TL;DR: We evaluate how well state-of-the-art LLMs can grade mathematical competition proofs, finding they reliably detect errors but overestimate scores, which we improve through reference-aware grading workflows.
Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro–generated solutions that we grade on a 1–4 scale with precise error types and locations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0–7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce Agentic Workflows that extract and analyze reference solutions and automatically derive task‑specific rubrics for a multi‑step grading process. We instantiate and compare two rubric design choices—approachability‑based weighting (by “aha” difficulty) and milestone‑based rubrics, and evaluate their trade‑offs. Across our annotated corpus and MathArena, these workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research. \href{https://github.com/ref-grader/ref-grader}{https://github.com/ref-grader/ref-grader
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23347
Loading