QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
Abstract: As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic evaluation Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first benchmark to systematically measure alignment with human experts on undergraduate-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix ($7$ judges $\times$ $5$ solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude 4.5 Opus exhibit significant positive bias (up to $+0.28$ mean score inflation), effectively "hallucinating rigor" in flawed proofs. Furthermore, we uncover a critical reasoning disparity: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 raw score), specialized reasoning models like o3-deep-research collapse in discrete domains, dropping to 42.1\% accuracy in Graph Theory. We release QEDBench as a public benchmark for evaluating and improving AI judges.
Lay Summary: AI systems can now write impressive-looking math proofs, but it is still unclear whether other AI systems can reliably judge whether those proofs are correct. This paper introduces QEDBENCH, a testbed for checking AI graders on university-level math proofs. The benchmark uses 272 expert-curated proof problems, more than 1,300 AI-generated proofs, seven AI judge models, five AI solver models, and over 1,000 hours of evaluation by PhD-level experts. The study finds that many AI judges are too generous: they often give high scores to proofs that look polished but contain serious logical mistakes. One model, Llama 4 Maverick, passed 90.2% of solutions, while human experts passed only 67.7%. The hardest failures appear in areas such as combinatorics and graph theory, where success requires building a careful argument rather than following a familiar recipe. The paper also shows that stricter written rubrics do not fully solve the problem, because AI judges often keep their built-in grading habits. Overall, QEDBENCH argues that progress in AI mathematics needs better proof-checking, not just better proof-writing, especially before automated graders are trusted in education or research.
Link To Code: https://github.com/qqliu/Yale-QEDBench
Primary Area: Deep Learning->Large Language Models
Keywords: math proof benchmark, proof verification, LLMs, evaluation-human-llm-alignment
Originally Submitted PDF: pdf
Submission Number: 29870
Loading