QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez; Alireza Amiri Bavandpour; Peter Ye; Edward Zhang; Ruslans Aleksejevs; Todor Antić; Polina Baron; Sujeet Bhalerao; Shubhrajit Bhattacharya; Zachary Burton; John Byrne; Hyungjun Choi; Nujhat Ahmed Disha; Koppány István Encz; Yuchen Fang; Robert Joseph George; Ebrahim Ghorbani; Alan Goldfarb; Jing Guo; Meghal Gupta; Stefano Huber; Annika Kanckos; Minjung Kang; Hyun Jong Kim; Dino Lorenzini; Levi Lorenzo; Tianyi Mao; Giovanni Marzenta; Ariane M. Masuda; Lukas Mauth; Ana Mickovic; Andrés Miniguano-Trujillo; Antoine Moulin; Wenqi Ni; Tomos Parry; Kevin Ren; Hossein Roodbarani; Mathieu Rundström; Manjil Saikia; Detchat Samart; Rebecca Steiner; Connor Stewart; Dhara Thakkar; Jeffrey Tse; Vasiliki Velona; Yunhai Xiang; Sibel Yalçın; Jun Yan; Ji Zeng; Arman Cohan; Quanquan C. Liu

QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic evaluation Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first benchmark to systematically measure alignment with human experts on undergraduate-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix ($7$ judges $\times$ $5$ solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude 4.5 Opus exhibit significant positive bias (up to $+0.28$ mean score inflation), effectively "hallucinating rigor" in flawed proofs. Furthermore, we uncover a critical reasoning disparity: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 raw score), specialized reasoning models like o3-deep-research collapse in discrete domains, dropping to 42.1\% accuracy in Graph Theory. We release QEDBench as a public benchmark for evaluating and improving AI judges.

Lay Summary: AI systems can now write impressive-looking math proofs, but it is still unclear whether other AI systems can reliably judge whether those proofs are correct. This paper introduces QEDBENCH, a testbed for checking AI graders on university-level math proofs. The benchmark uses 272 expert-curated proof problems, more than 1,300 AI-generated proofs, seven AI judge models, five AI solver models, and over 1,000 hours of evaluation by PhD-level experts. The study finds that many AI judges are too generous: they often give high scores to proofs that look polished but contain serious logical mistakes. One model, Llama 4 Maverick, passed 90.2% of solutions, while human experts passed only 67.7%. The hardest failures appear in areas such as combinatorics and graph theory, where success requires building a careful argument rather than following a familiar recipe. The paper also shows that stricter written rubrics do not fully solve the problem, because AI judges often keep their built-in grading habits. Overall, QEDBENCH argues that progress in AI mathematics needs better proof-checking, not just better proof-writing, especially before automated graders are trusted in education or research.

Link To Code: https://github.com/qqliu/Yale-QEDBench

Primary Area: Deep Learning->Large Language Models

Keywords: math proof benchmark, proof verification, LLMs, evaluation-human-llm-alignment

Originally Submitted PDF: pdf

Submission Number: 29870

Loading