SMART: A Self-Validating Multi-Dimensional Assessment Framework for Evaluating LLMs’ Mathematical Problem-Solving Processes

SMART: A Self-Validating Multi-Dimensional Assessment Framework for Evaluating LLMs’ Mathematical Problem-Solving Processes

ACL ARR 2025 February Submission1007 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various mathematical benchmarks. However, concerns persist over whether these high scores indicate genuine mathematical capability or merely superficial pattern recognition. Furthermore, we contend that the commonly used metric of final answer accuracy fails to capture the performance of LLMs on nuanced factors, as it reflects a composite outcome influenced by multiple factors. This motivates us to introduce SMART (Self-Validating Multi-Dimensional Assessment Framework), which deconstructs the problem-solving process into four key dimensions: understanding, reasoning, arithmetic, and reflection \& refinement. Crucially, SMART does not evaluate based on final answer accuracy but instead designs separate tasks and evaluation methods for each dimension, enabling detailed and controllable assessments that decouple individual factors. Additionally, we propose a self-validating mechanism that iteratively generates and verifies test data, ensuring benchmark reliability and scalability. We evaluate 13 open-source and closed-source LLMs using SMART, and our findings reveal that final answer accuracy is insufficient for evaluating true mathematical problem-solving capabilities. Our analysis highlights symbolic reasoning and reflection \& refinement as the key factors that distinguish LLM performance. We hope these insights will provide valuable guidance for advancing LLMs' true mathematical competence, and we will release our code and benchmark upon acceptance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies; benchmarking; metrics; automatic creation and evaluation of language resources

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 1007

Loading