HighMATH: Evaluating Math Reasoning of Large Language Models in Breadth and Depth

ACL ARR 2025 February Submission6510 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid development of large language models (LLMs) in math reasoning, the accuracy of models on existing math benchmarks has gradually approached 90% or even higher. More challenging math benchmarks are hence urgently in need to satisfy the increasing evaluation demands. To bridge this gap, we propose HighMATH. Problems in HighMATH are collected according to 3 criteria: problem complexity, knowledge domain diversity and fine-grained annotations. We collect 5,293 problems from Chinese senior high school mathematics exams published in 2024, covering 8 subjects and 7 levels of difficulty, with each problem involving an average of more than 2.4 knowledge points. We conduct a thorough evaluation of latest LLMs on the curated HighMATH, including o1-like models. Evaluation results demonstrate that the accuracy of advanced LLMs on HighMATH is significantly lower than that on previous math reasoning benchmarks. Even with the use of majority voting and a Python executor, the highest accuracy does not exceed 62% on HighMATH. Our results also suggest that properly trained smaller LLMs may have great potential in math reasoning.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, language resources
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Chinese
Submission Number: 6510
Loading