IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Johannes Schmitt; Gergely Berczi; Jasper Dekoninck; Jeremy Feusi; Tim Gehrunger; Raphael Appenzeller; Jim Bryan; Niklas Canova; Timo de Wolff; Filippo Gaia; Michel van Garrel; Baran Hashemi; David Holmes; Aitor Iribar Lopez; Victor Jaeck; Martina Jørgensen; Steven Kelk; Stefan Kuhlmann; Adam Kurpisz; Chiara Meroni; Ingmar Metzler; Samuel Muñoz-Echániz; Robert Nowak; Georg Oberdieck; Daniel Platt; Dylan Possamaï; Gabriel Ribeiro; Raúl Sánchez Galán; Zheming Sun; Josef Teichmann; Richard P Thomas; Charles Vial

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, mathematical reasoning, Large Language Models, agentic evaluations

TL;DR: We introduce IMProofBench, a peer-reviewed, tool-augmented, multi-turn benchmark of 39 research-level math problems that hybridizes human and automatic grading to assess LLM proof-writing.

Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited because they focus on final-answer questions or high-school competition problems. To address this, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires an LLM to produce a proof, which is then graded by the problem's author. Within an evaluation environment equipped with various tools, the best model, GPT-5, solves 22% of the problems, closely followed by Grok-4 at 19%. Importantly, an analysis of our results indicates that current LLMs can aid research mathematicians on a basic level, but still need significant supervision to avoid simple mistakes. As LLMs continue to improve, IMProofBench will evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.

Submission Number: 200

Loading