A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

Published: 01 Mar 2025, Last Modified: 01 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Using AI to write formal proofs for mathematical problems is a challenging task that has seen some advancements in recent years. Automated systems such as Lean can verify the correctness of proofs written in formal language, yet writing the proofs in formal language can be challenging for humans and machines. The miniF2F benchmark has 20 IMO problems in its test set, yet formal proofs are available only for 6 of these problems (3 of which are only written by mathematicians). The model with best accuracy can only prove 2 of these 20 IMO problems, from 1950s and 60s, while its training set is a secret. In this work, we write complete, original formal proofs for the remaining IMO problems in Lean along with 3 extra problems from IMO 2022 and 2023. This effort expands the availability of proof currently in the public domain by creating 5,880 lines of Lean proof. The goal of the paper is to pave the way for developing AI models that can automatically write the formal proofs for all the IMO problems in miniF2F and beyond by providing an evaluation benchmark. In this pursuit, we devise a method to decompose the proofs of these problems into their building blocks, constructing a dataset of 1,329 lemmas with more than 40k lines of Lean code. These lemmas are not trivial, yet they are approachable, providing the opportunity to evaluate and diagnose the failures and successes of AI models. We evaluate the ability of the SOTA LLMs on our dataset and analyze their success and failure modes from different perspectives. Our dataset and code is available at: https://github.com/roozbeh-yz/IMO-Steps.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear Action Editor, We have now submitted the camera-ready version of our paper. We have now added a new co-author as discussed before. We addressed all the feedback in your comments. Here is a list of changes in our final revision: 1. We extended our evaluation to 6 new models as presented in Tables 5, 6. and E4. As you had requested, we evaluated the accuracy of Qwen Math 2.5 and Llama 3.1 on our dataset (Table E4). We also evaluated the accuracy of Goedel Prover, DeepSeek Prover 1.5, and ReProver. Goedel Prover was released this February with SOTA accuracy on miniF2F. DeepSeek Prover was the previous SOTA model until this February and turns out to have the best accuracy (39%) on our dataset. Additionally, we updated our experiments from o1-mini to o3-mini. We also experimented with ReProver which is an established model for automated theorem proving with an open-source training set. 2. We have made revisions in our introduction to reflect our discussions with the reviewers and to explain how we suggest our dataset to be used. 3. We have added additional explanations to examine the length of the proofs that these LLMs can correctly prove. Thank you again for your helpful feedback. Sincerely, The Authors
Code: https://github.com/roozbeh-yz/IMO-Steps
Assigned Action Editor: ~Shay_B_Cohen1
Submission Number: 3803
Loading