Keywords: Benchmarks, Large Language Models, Mathematical Reasoning, Mathematics, Reasoning, Machine Learning
TL;DR: Putnam-AXIOM is a challenging mathematical reasoning benchmark for LLMs, revealing significant reasoning performance gaps and the impact of data contamination.
Abstract: As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated. Therefore, we present the Putnam-AXIOM Original benchmark consisting of 236 mathematical problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To preserve the Putnam-AXIOM benchmark's validity and mitigate potential data contamination, we created the Putnam-AXIOM Variation benchmark with functional variations of 52 problems. By programmatically altering problem elements like variables and constants, we can generate unlimited novel, equally challenging problems not found online. We see that almost all models have significantly lower accuracy in the variations than the original problems. Our results reveal that OpenAI's o1-preview, the best performing model, achieves merely 41.95\% accuracy on the Putnam-AXIOM Original but experiences around a 30\% reduction in accuracy on the variations' dataset when compared to corresponding original problems.
Concurrent Submissions: ICLR 2025
Submission Number: 86
Loading