Boosting LLM Math Performance with Structured Code Prompts and Symbolic Verification

Boosting LLM Math Performance with Structured Code Prompts and Symbolic Verification

ACL ARR 2025 May Submission1865 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While large language models (LLMs) have gained significant attention for mathematical problem-solving, many existing benchmarks require only shallow reasoning and have limited scope, hindering rigorous evaluation of their understanding of mathematical logic. To address this gap, we evaluate several LLMs on the more challenging Conic10K dataset, which focuses on conic section problems. Using code prompts, fine-tuning, and decoding strategies, we improve performance, boosting QWEN-72B from 20.2\% to 34.3\%. Notably, DeepSeek-R1 achieves a new state-of-the-art accuracy of 97.3\% with code prompting, up from $92.7\%$, demonstrating that even high-performing models benefit from symbolic input when properly aligned. We also develop an automated verification system that independently processes 41.9\% of results, reducing human evaluation cost. Our results underscore the importance of structured symbolic prompting in enhancing mathematical reasoning and highlight the potential of code-based methods as a general framework for improving LLM performance on complex math tasks.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: Mathematical reasoning, Language Modeling, Large Language Models

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, Chinese

Submission Number: 1865

Loading