Keywords: mathematical reasoning, benchmark
Abstract: With the continuous development of large language models (LLMs), their capabilities in solving mathematical problems have attracted increasing attention from the research community. Bubeck et al. present that GPT-5 can accelerate scientific progress and propose new ideas in interdisciplinary research. Additionally, Georgiev, Gómez-Serrano, Tao, and Wagner utilize large language models as powerful tools for scalable mathematical exploration, enabling conjecture generation and pattern discovery that complement human mathematical reasoning. While recent advances have highlighted the strong potential of large language models in mathematical research, their capabilities on TCS problems remain underexplored, motivating the development of a benchmark to evaluate the understanding and reasoning abilities of large language models on TCS mathematical problems. In our work, we proposed a comprehensive benchmark for four state-of-the-art models: GPT-5-Thinking, Grok-4, Gemini-3-Pro-Thinking, and Claude-Sonnet-4.5, based on the Communication Complexity of Rao and Yehudayoff.
We have each model generate LaTeX-formatted proofs for the exercises and theorems in the book, and assess the results through a combination of LLM-based evaluation and use human annotators for verification. The results indicate that while leading models (Gemini and Claude) achieve accuracies of approximately 92.0\%, other models attain only around 22.7\%. Moreover, although cutting-edge models have reached a proficiency level suitable for graduate-level teaching assistance and mathematical formalization, substantial disparities persist in their reliability for rigorous mathematical derivations.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: english
Submission Number: 2522
Loading