SMART: A Self-Validating Multi-Dimensional Assessment Framework for Evaluating LLMs’ Mathematical Problem-Solving Processes
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various mathematical benchmarks. However, concerns persist over whether these high scores indicate genuine mathematical capability or merely superficial pattern recognition. Furthermore, we contend that the commonly used metric of final answer accuracy fails to capture the performance of LLMs on nuanced factors, as it reflects a composite outcome influenced by multiple factors. This motivates us to introduce SMART (Self-Validating Multi-Dimensional Assessment Framework), which deconstructs the problem-solving process into four key dimensions: understanding, reasoning, arithmetic, and reflection \& refinement. Crucially, SMART does not evaluate based on final answer accuracy but instead designs separate tasks and evaluation methods for each dimension, enabling detailed and controllable assessments that decouple individual factors. Additionally, we propose a self-validating mechanism that iteratively generates and verifies test data, ensuring benchmark reliability and scalability. We evaluate 13 open-source and closed-source LLMs using SMART, and our findings reveal that final answer accuracy is insufficient for evaluating true mathematical problem-solving capabilities. Our analysis highlights symbolic reasoning and reflection \& refinement as the key factors that distinguish LLM performance. We hope these insights will provide valuable guidance for advancing LLMs' true mathematical competence, and we will release our code and benchmark upon acceptance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies; benchmarking; metrics; automatic creation and evaluation of language resources
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 1007
Loading