ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ICLR 2026 Conference Submission22638 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Symbolic Mathematics, CAS, Tool Use, AI for Science, AI for Math, Integrals, Benchmark, Reasoning, Mathematical Capabilities, Evaluation
TL;DR: Assessing LLM mathematics skills using a new question dataset that focuses on symbolic computation - we show that the most advanced models go beyond memorizing patterns and show signs of deeper understanding of symbolic math.
Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present **ASyMOB**, a high-resolution dataset of **35,368** validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, **ASyMOB** systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization and robustness. Our evaluation reveals three key findings: (1) most models’ performance collapses under minor perturbations, while frontier systems exhibit substantial robustness, suggesting an emerging *"phase transition"* from memorization to generalization; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. **ASyMOB** serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 22638
Loading