Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving

Evgenii Evstafev

Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving

Evgenii Evstafev

Published: 09 Mar 2025, Last Modified: 10 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Mathematical Reasoning, Code Generation, Benchmarking, Token-by-Token Regeneration, Computational Efficiency, MATH Dataset, Domain-Specific Performance, Automated Evaluation, Safety Constraints

Abstract: Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathematical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evaluates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% performance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

Submission Number: 39

Loading