Measuring Progress in Reasoning Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang; Sumeet Ramesh Motwani; James V Roggeveen; Eliot Hodges; Dulhan Jayalath; Charles London; Kalyan Ramakrishnan; Cheng Zhang; Flaviu Cipcigan; Philip Torr; Alessandro Abate

Measuring Progress in Reasoning Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang, Sumeet Ramesh Motwani, James V Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Cheng Zhang, Flaviu Cipcigan, Philip Torr, Alessandro Abate

Published: 17 Jun 2026, Last Modified: 23 Jun 2026ICML 2026 AI4Math Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Code Url: https://github.com/ewang26/HorizonMath

Keywords: AI for math, evaluation, benchmark

TL;DR: We introduce an open-source evaluation consisting of 100+ unsolved research problems in mathematics with automated deterministic verifiers, already leading to three verified novel discoveries.

Abstract: Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of 113 predominantly unsolved problems spanning eight domains in mathematics and the mathematical sciences, paired with an open-source evaluation framework for automated verification. Our benchmark targets the generator-verifier gap: problems where discovery is hard and requires meaningful mathematical insight, but verification is computationally straightforward. This contrasts with most existing research-level benchmarks, which instead rely on formal proof verification or manual review, both of which are expensive to scale. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Using this framework, we find three research problems for which GPT 5.4 Pro proposes novel solutions that either resolve previously open questions or improve on the best-known published results. Across six frontier models, reasoning efficiency and behavior also vary significantly. We release HorizonMath as an open challenge and a growing community resource, where each verified solution is a candidate contribution to the mathematical literature.

Submission Number: 197

Loading