Abstract: This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies focusing solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models—including Mixtral, Llama, Gemini, GPT-4o, and OpenAI’s o1 variants—we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers via flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and inability to translate physical intuition into mathematical steps. Manual scrutiny reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs’ problem-solving proficiency. The study highlights persistent gaps in LLMs’ generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLMs, mathematical reasoning, evaluation, reasoning failure
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2199
Loading