Beyond Short Answers: Evaluating Long-Form Reasoning in LLMs via Communication Complexity

Beyond Short Answers: Evaluating Long-Form Reasoning in LLMs via Communication Complexity

ACL ARR 2026 January Submission2522 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mathematical reasoning, benchmark

Abstract: With the continuous development of large language models (LLMs), their capabilities in solving mathematical problems have attracted increasing attention from the research community. Bubeck et al. present that GPT-5 can accelerate scientific progress and propose new ideas in interdisciplinary research. Additionally, Georgiev, Gómez-Serrano, Tao, and Wagner utilize large language models as powerful tools for scalable mathematical exploration, enabling conjecture generation and pattern discovery that complement human mathematical reasoning. While recent advances have highlighted the strong potential of large language models in mathematical research, their capabilities on TCS problems remain underexplored, motivating the development of a benchmark to evaluate the understanding and reasoning abilities of large language models on TCS mathematical problems. In our work, we proposed a comprehensive benchmark for four state-of-the-art models: GPT-5-Thinking, Grok-4, Gemini-3-Pro-Thinking, and Claude-Sonnet-4.5, based on the Communication Complexity of Rao and Yehudayoff. We have each model generate LaTeX-formatted proofs for the exercises and theorems in the book, and assess the results through a combination of LLM-based evaluation and use human annotators for verification. The results indicate that while leading models (Gemini and Claude) achieve accuracies of approximately 92.0\%, other models attain only around 22.7\%. Moreover, although cutting-edge models have reached a proficiency level suitable for graduate-level teaching assistance and mathematical formalization, substantial disparities persist in their reliability for rigorous mathematical derivations.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation

Contribution Types: Model analysis & interpretability, Theory

Languages Studied: english

Submission Number: 2522

Loading