Abstract: While large language models (LLMs) have gained significant attention for mathematical problem-solving, many existing benchmarks require only shallow reasoning and have limited scope, hindering rigorous evaluation of their understanding of mathematical logic. To address this gap, we evaluate several LLMs on the more challenging Conic10K dataset, which focuses on conic section problems. Using code prompts, fine-tuning, and decoding strategies, we improve performance, boosting QWEN-72B from 20.2\% to 34.3\%. Notably, DeepSeek-R1 achieves a new state-of-the-art accuracy of 97.3\% with code prompting, up from $92.7\%$, demonstrating that even high-performing models benefit from symbolic input when properly aligned. We also develop an automated verification system that independently processes 41.9\% of results, reducing human evaluation cost. Our results underscore the importance of structured symbolic prompting in enhancing mathematical reasoning and highlight the potential of code-based methods as a general framework for improving LLM performance on complex math tasks.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: Mathematical reasoning, Language Modeling, Large Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 1865
Loading