Keywords: large language model, mathematical reasoning, memorization, cheating
TL;DR: We introduce a challenging benchmark to evaluate LLMs' mathematical reasoning and code-writing abilities, finding that specialized models like o1-mini outperform earlier ones but still struggle overall.
Abstract: We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.
Croissant File: json
Dataset URL: https://github.com/ceodspspectrum/oeis-sequence-benchmark/
Supplementary Material: zip
Primary Area: Datasets & Benchmarks illustrating Different Deep learning Scenarios (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 2337
Loading