Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Putnam-AXIOM is a challenging mathematical reasoning benchmark for LLMs, revealing significant reasoning performance gaps and the impact of data contamination.
Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.
Lay Summary: Modern AI chatbots can already pass many school-level math tests, so researchers need harder ways to see whether these systems are truly “thinking” or just repeating what they have seen online. We built a new test set called **Putnam-AXIOM** by collecting 522 tough questions from the famous Putnam university math contest. To make sure the bots have not memorized the answers from the internet, we also wrote a computer program that automatically rewrites each problem—changing numbers, variable names, and wording—while keeping the difficulty the same. This lets us generate endless fresh versions that the bots have never encountered. When we ran twenty leading AI models on our test, every one of them scored lower on the rewritten versions than on the originals; the best model’s score fell from 42 % to 22 %. This gap suggests that today’s systems still rely heavily on memorization. Finally, we introduce a simple scoring method that checks how well a model follows each step of a correct solution, not just the final answer. All data, code, and evaluation tools are publicly released so others can track future progress in genuine mathematical reasoning.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/brando90/putnam-axiom
Primary Area: Deep Learning->Robustness
Keywords: Benchmarks, Large Language Models, Mathematical Reasoning, Mathematics, Reasoning, Machine Learning
Submission Number: 13558
Loading