QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Published: 01 May 2026, Last Modified: 26 May 2026OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic evaluation Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first benchmark to systematically measure alignment with human experts on undergraduate-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix ( judges solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude 4.5 Opus exhibit significant positive bias (up to mean score inflation), effectively "hallucinating rigor" in flawed proofs. Furthermore, we uncover a critical reasoning disparity: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 raw score), specialized reasoning models like o3-deep-research collapse in discrete domains, dropping to 42.1% accuracy in Graph Theory. We release QEDBench as a public benchmark for evaluating and improving AI judges.