Keywords: math, benchmark, dataset, reasoning
TL;DR: We introduce a new dataset of difficult graduate-level applied mathematics problems; evaluations demonstrate that leading LLMs currently exhibit low accuracy in solving these problems.
Abstract: Advanced applied mathematics problems are not well-represented in existing benchmarking datasets used to evaluate Large Language Models (LLMs). To address this, we introduce **HARDMath**, the Harvard Approximate Reasoning Dataset for Mathematics—a dataset of 1,466 difficult problems inspired by Harvard University’s graduate course on asymptotic methods. The dataset contains a diverse set of challenging applied mathematics problems with worked solutions that employ various analytical approximation methods. Developing such solutions typically requires multiple modes of analysis—including mathematical reasoning, the use of computational tools, and subjective judgment—making this a challenging problem for LLMs. We establish a framework that auto-generates an arbitrarily large number of ‘hard’ applied mathematics problems with approximate analytical solutions that include validity checks against numerical ground-truths. We evaluate frontier LLMs on **HARDMath-mini**, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level asymptotic math problems and underscore the importance of datasets like **HARDMath** to advance mathematical abilities of LLMs.
Concurrent Submissions: NeurIPS 2024 Track Datasets and Benchmarks
Submission Number: 30
Loading