RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Competition Proof Grading, LLM-as-a-judge, Proof Grading
TL;DR: We design agentic workflows that grade math Olympiad level problems accurately
Abstract: State-of-the-art LLMs have advanced from failing proof-based Olympiad problems to solving 5 of 6 IMO 2025 problems. We assess, as a case study, whether Gemini 2.5 Pro can grade proofs by detecting errors, assigning severity, and awarding partial credit beyond binary correctness. We evaluate Gemini 2.5 Pro’s performance using two datasets: (1) 90 solutions generated by Gemini 2.5 Pro, carefully annotated by expert evaluators with scores of 1–4 and precise error annotations, (2) MathArena IMO/USAMO 2025 solutions scored 0–7. We first show that single-step grading is unreliable: while the model reliably flags incorrect solutions, it struggles with partial credit calibration. To address this, we introduce Ref-Grader, a grader agent implemented with Gemini 2.5 Pro with several workflows that automatically derive problem-specific rubrics from reference solutions for multi-step grading. We provide comprehensive analysis and ablation studies across the proposed workflows, demonstrating superior agreement with human grades and more reliable partial credit assignment across all metrics on both datasets.
Submission Number: 260
Loading