CC-Eval: A Benchmark for PhD-Level Mathematical Reasoning in Communication Complexity

ACL ARR 2026 January Submission2799 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mathematical reasoning, benchmark
Abstract: Large language models (LLMs) have recently shown promising abilities in mathematical reasoning, yet their performance on foundational graduate-level theory remains insufficiently understood. In this work, we present a benchmark for assessing frontier LLMs on Communication Complexity by Kushilevitz and Nisan, a foundational textbook in theoretical computer science. We evaluate four state-of-the-art models by requiring them to generate complete LaTeX proofs for a diverse set of textbook lemmas and exercises. The results reveal substantial variation in performance: the strongest model achieves an overall accuracy of 78.4 \%, while the weakest reaches only 49.5 \%, with other models falling between these extremes. Beyond aggregate accuracy, we conduct a qualitative analysis of the generated proofs, highlighting systematic differences in logical structure, conciseness, and failure modes. These findings indicate that while current frontier models can assist with graduate-level learning and partial formalization in communication complexity, their reliability for producing fully rigorous proofs remains uneven. This benchmark provides a focused assessment of the strengths and limitations of LLMs on core theoretical computer science tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmark, evaluation
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: english
Submission Number: 2799
Loading