The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai, artificial intelligence, reasoning, llm, math, benchmark, dataset, proof, gpt, machine learning
TL;DR: We conduct the largest human-based evaluation of frontier Large Reasoning Models on challenging mathematical proofs and answer multiple open questions in the field.
Abstract: In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and addressing key open questions in the field of automated proof generation. Specifically, it remains unknown (1) how large the gap is between natural language and formal proof generation, (2) how final-answer accuracy relates to full proof correctness, and (3) how best-of-n selection strategies can affect proof quality. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first large dataset of LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we address the open questions outlined above and provide new insights into LLMs' strengths and limitations in mathematical reasoning. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that matches Gemini-2.5-Pro, and performs close to the best model, GPT-5, on evaluating proof correctness.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4513
Loading