Keywords: benchmark, mathematical reasoning, Large Language Models, agentic evaluations
TL;DR: We introduce IMProofBench, a peer-reviewed, tool-augmented, multi-turn benchmark of 39 research-level math problems that hybridizes human and automatic grading to assess LLM proof-writing.
Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge.
However, existing benchmarks are limited because they focus on final-answer questions or high-school competition problems.
To address this, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires an LLM to produce a proof, which is then graded by the problem's author.
Within an evaluation environment equipped with various tools, the best model, GPT-5, solves 22% of the problems, closely followed by Grok-4 at 19%.
Importantly, an analysis of our results indicates that current LLMs can aid research mathematicians on a basic level, but still need significant supervision to avoid simple mistakes. As LLMs continue to improve, IMProofBench will evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.
Submission Number: 200
Loading