Keywords: benchmark, combinatoric, neural theorem proving, formal methods, large language models, AI for Math, formal reasoning
Abstract: Neurosymbolic approaches that integrate large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry, and number theory. In comparison, combinatorics remains a challenging domain, characterised by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial competition problems, each formalized in Lean 4 and paired with its corresponding informal statement. The problems cover a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics.
Furthermore, we provide a comprehensive and standardized evaluation framework for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. We open source the benchmark dataset alongside with the code of the proposed evaluation method.
Primary Area: datasets and benchmarks
Submission Number: 11422
Loading