MathEval: A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities

26 Sept 2024 (modified: 23 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mathematical reasoning benchmark, Adaptation strategies, Cross-lingual assessment
Abstract: Mathematical reasoning is a fundamental aspect of intelligence, encompassing a spectrum from basic arithmetic to intricate problem-solving. Recent investigations into the mathematical abilities of large language models (LLMs) have yielded inconsistent and incomplete assessments. In response, we introduce MathEval, a comprehensive benchmark designed to methodically evaluate the mathematical problem-solving proficiency of LLMs across varied contexts, adaptation strategies, and evaluation metrics. MathEval amalgamates 19 datasets, spanning an array of mathematical domains, languages, problem types, and difficulty levels, from elementary to advanced. This diverse collection facilitates a thorough appraisal of LLM performance and is stratified by language (English and Chinese), problem category (arithmetic, competitive mathematics, and higher mathematics), and difficulty. To overcome the challenges of standardizing responses across diverse models and prompts, we've developed an automated LLM-driven pipeline for answer extraction and comparison, ensuring consistent evaluation criteria. To broaden the utility of MathEval beyond the scope of GPT-4, we have harnessed the extensive results from GPT-4 to train a deepseek-7B-based answer comparison model, enabling precise answer validation for those without access to GPT-4. This model will also be made publicly available. MathEval not only assesses mathematical proficiency but also introduces a method to identify potential data contamination within pre-training datasets. This is done by hypothesizing that enhancements in one mathematical dataset should be mirrored by advancements in correlated datasets, thus signaling potential contamination—like the inadvertent inclusion of test data in the pre-training phase. To mitigate this and truly gauge progress, MathEval incorporates an annually refreshed set of problems from the latest Chinese National College Entrance Examination (Gaokao 2023), thereby benchmarking genuine advancements in mathematical problem solving skills. MathEval strives to refine the assessment of Large Language Models' (LLMs) capabilities in mathematics.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6578
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview