# Open Proof Corpus (OPC)

We introduce the Open Proof Corpus (OPC)—the world’s first large-scale, open-source dataset of human-verified solutions to advanced mathematics problems. With over 5,000 solutions spanning 1,000+ challenging problems, the OPC is specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. 

Leverage OPC to tackle pressing research questions in automated proof generation:
(1) How do natural language and formal proof generation compare? 
(2) How often do models that produce correct final answers truly reason their way to valid proofs? 
(3) By what margin do best-of-n selection methods improve proof quality?

Building on this breakthrough resource, we present **OPC-R1-8B** - an open-source model for proof correctness judging that matches state-of-the-art performance. OPC-R1-8B outperforms the majority of leading closed-source models, reaching an impressive 88.1% accuracy on verifying LLM-generated proofs.

## Dataset Statistics

We include 5,000+ human-verified samples from 1,000+ solutions across over 20 national and international math competitions. Each proof is labeled as either correct or incorrect by one or two human judges. Labels are accompanied by short justifications, with optional annotations highlighting specific sentences within the proof. We include problems of varying difficulty, as well as a large portion of the USAMO, IMO and Balkan MO Shortlists (see full distribution below). Note that our MathArena results include statistics containing the 2025 SMT competition, which we will make public after the official source becomes available.

## OPC-R1-8B
We release our judging model **OPC-R1-8B**, that **matches Gemini 2.5 Pro** on our LLM as a Judge evaluation.

<div align="center">

|            **Model**            |                          **Download**                         |
| :-----------------------------: | :----------------------------------------------------------: |
|  OPC-R1-8B  | [🤗 HuggingFace](Anonymized Link) |
</div>

## Dataset Description

The OPC contains a rich variety of information, and splits, depending on the evaluation task. 

### Columns

- **problem_id** - A unique identifier for each problem statement, including the competition name and year. The problem number presented in the id *might not correspond exactly to the problem number in the actual competition*.
- **problem** - The problem statement.
- **solution** - The model proof.
- **score** - The scores assigned by our judge(s).
- **feedback** - The judge's explanation for their score.
- **model_id** - The identifier for the solver model. (NOTE: Some of our best-of-n results include a V1 version of our Ranking Judge, which contains a suboptimal ranking methodology.)
- **uncertain** - Whether the judge marked their grade as uncertain.
- **annotations** (Optional) - Comments on the solution from the judge.
- **judge_id** - The identifier of the judge that graded the solution.
- **competition** - The competition, from which the problem was taken.
- **category** (Optional) - A list containing the problem categories the problem belongs to (Algebra, Combinatorics, Geometry, Number Theory).
- **level** - Whether the competition is high-school- or undergraduate-level
- **source** - Our main source for where we took the problems from. This is usually the official competition website, or prior work in mathematical reasoning (i.e MathArena, PutnamBench).
- **url** - A link to the source of the data. (when available)
- **year** - The year the competition was held in.
- **split** - The split the entry belongs in (further explained below). 
- **problem_issue** (Optional) - Any indicated issue with the problem statement.
- **ground_truth_solution** (Optional) - The author solution (if one is provided).
- **ground_truth_issue** (Optional) - Any indicated issue with the author solution.
- **ground_truth_note** (Optional)- Comments for the author solutions provided to our judges.
- **thinking** - The contents of the thinking block (for open-source models).
- **cost** - The cost for generating the solution in USD.
- **output_tokens** - The reported number of output tokens used to calculate the cost.
- **input_tokens** - The number of input tokens used to calculate the cost.
- **timestamp** - The time of the last update made to the score or feedback.
- **llm_summary** - A summary of the model solution, provided to our judges. Generated by o4-mini.
- **llm_issues** - A list of potential solution issues, provided to our judges. Generated by o4-mini.
- **other_selectors** - For best-of-n evaluations - any judges that also selected the given solution.

### Splits

We split our data into 6 different sections:

 - **Generic** - The largest part of the OPC, used for evaluating proof generation capabalities, and for training the **OPC-R1-8B**.
 - **Test** - Designated for our LLM as a Judge evaluation.
 - **PutnamBench** - Informal subset of PutnamBench for evaluating the gap between formal and informal reasoning.
 - **MathArena** - Final-answer problems from the MathArena benchmark.
 - **Best-of-n** - Solutions selected by best-of-n selection mehtods.
 - **Pass-at-n** - The subset, from which we derive our pass@8 metrics for best-of-n selection methods.
