PyraMathBench: A Comprehensive Framework for Evaluating and Improving Mathematical Capability in Large Language Models
Abstract: Despite the critical role of mathematical capabilities in large language models (LLMs) across various applications, few frameworks comprehensively evaluate these abilities from foundational to advanced levels. This gap hinders the exploration of the weaknesses in the mathematical abilities of LLMs. In this paper, we introduce PyraMathBench\footnote{\url{https://osf.io/h4fwr/?view_only=392e23bd1b2443cd802b4c9ccef93dee}}, a framework designed to assess the mathematical capability of LLMs across four difficulty aspects, emphasizing the breakdown of complex tasks into simpler foundational components. PyraMathBench includes tailored single-modal and multimodal subtasks to rigorously evaluate model performance. We also propose the plug-and-play math model, a dynamic toolkit that enhances the mathematical processing abilities of LLMs, especially in Calculation aspect tasks requiring intricate computation. Subsequent experiments with existing LLMs have led to the following findings: (i) LLMs' limited capacity for abstraction, task decomposition, and equation solving hinder their reasoning process. (ii) MLLMs predominantly rely on textual information when inferring Visual Reasoning Problems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 6981
Loading