HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V Roggeveen; Erik Y. Wang; David Ettel; Will Flintoft; Peter Donets; Raglan Ward; Ahmed Roman; Anton Marius Graf; Siddharth Dandavate; Ava Williamson; Felix Yeung; Kacper K Migacz; Yijun Wang; Egemen Bostan; Duy Thuc Nguyen; Zhe He; Marc L. Descoteaux; Anne Mykland; Shida Liu; Jorge García Ponce; Luke Zhu; Yuyang Chen; Ekaterina S. Ivshina; Miguel Fernandez; Minjae Kim; Kennan Gumbs; Matthew Scott Tan; Russell Yang; Mai Hoang; David Brown; Isabella A Silveira; Lavon Sykes; Arjun Nageswaran; William Fredenberg; Yiming Chen; Lucas Martin; Yixing Tang; Kelly Werker Smith; Hongyu Liao; Logan G. Wilson; Alexander Dazhen Cai; Lucy S. Nathwani; Nickholas Gutierrez; Andrea Elizabeth Biju; Michael Brenner

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: math, benchmark, dataset, AI for education

TL;DR: We introduce a challenging benchmark of graduate-level problems in applied mathematics that was fully developed as part of a university class.

Abstract: Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present $\textbf{HARDMath2}$, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/JVRoggeveen/HARDMath2

Code URL: https://github.com/JamesRoggeveen/hardmath2_eval

Primary Area: Evaluation (e.g., data collection methodology, data processing methodology, data analysis methodology, meta studies on data sources, extracting signals from data, replicability of data collection and data analysis and validity of metrics, validity of data collection experiments, human-in-the-loop for data collection, human-in-the-loop for data evaluation)

Submission Number: 2099

Loading