PyraMathBench: A Comprehensive Framework for Evaluating and Improving Mathematical Capability in Large Language Models

PyraMathBench: A Comprehensive Framework for Evaluating and Improving Mathematical Capability in Large Language Models

ACL ARR 2025 February Submission6981 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite the critical role of mathematical capabilities in large language models (LLMs) across various applications, few frameworks comprehensively evaluate these abilities from foundational to advanced levels. This gap hinders the exploration of the weaknesses in the mathematical abilities of LLMs. In this paper, we introduce PyraMathBench\footnote{\url{https://osf.io/h4fwr/?view_only=392e23bd1b2443cd802b4c9ccef93dee}}, a framework designed to assess the mathematical capability of LLMs across four difficulty aspects, emphasizing the breakdown of complex tasks into simpler foundational components. PyraMathBench includes tailored single-modal and multimodal subtasks to rigorously evaluate model performance. We also propose the plug-and-play math model, a dynamic toolkit that enhances the mathematical processing abilities of LLMs, especially in Calculation aspect tasks requiring intricate computation. Subsequent experiments with existing LLMs have led to the following findings: (i) LLMs' limited capacity for abstraction, task decomposition, and equation solving hinder their reasoning process. (ii) MLLMs predominantly rely on textual information when inferring Visual Reasoning Problems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 6981

Loading