Putnam-AXIOM: A Functional and Static Benchmark

Aryan Gulati; Brando Miranda; Eric Chen; Emily Xia; Kai Fronsdal; Bruno de Moraes Dumont; Sanmi Koyejo

Putnam-AXIOM: A Functional and Static Benchmark

Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno de Moraes Dumont, Sanmi Koyejo

Published: 09 Jul 2025, Last Modified: 16 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Benchmarks, Large Language Models, Mathematical Reasoning, Mathematics, Reasoning, Machine Learning

TL;DR: Putnam-AXIOM is a challenging mathematical reasoning benchmark for LLMs, revealing significant reasoning performance gaps and the impact of data contamination.

Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://anonymous.4open.science/r/putnam-axiom-8849.

Submission Number: 142

Loading