LeanComb: A Combinatorial Identities Benchmark for Automated Theorem Proving

LeanComb: A Combinatorial Identities Benchmark for Automated Theorem Proving

ICLR 2026 Conference Submission16005 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automatic theorem generation, large language models, lean proof assistants, combinatorics

Abstract: Automated theorem proving (ATP) in complex mathematical domains remains a fundamental challenge for large language models (LLMs), due to the scarcity and imbalance of formalized training data. Combinatorics, with its discrete structures and symbolic reasoning, provides a demanding testbed for evaluating ATP capabilities. Addressing this data scarcity gap, we propose a comprehensive data-centric framework built upon two essential components: \textsc{LeanComb}, a high-quality human-curated dataset, and \textsc{ATG4CI}, a novel method for automated theorem generation. \textsc{LeanComb} is a manually curated dataset of formalized combinatorial identities in Lean 4. It encompasses eight fundamental areas of combinatorics, with training and test sets derived from the classical literature, enabling robust evaluation of cross-domain generalization. To overcome the data sparsity, we develop a data augmentation framework, the \textbf{A}utomated \textbf{T}heorem \text{G}enerator for \textbf{C}ombinatorial \textbf{I}dentities (\textsc{ATG4CI}). It introduces a novel "Learn-from-Failure" paradigm, combining LLM-guided exploration with reinforcement learning-driven search to systematically discover new theorems from the boundaries of models' reasoning capabilities. Applied to \textsc{LeanComb}, ATG4CI generates over 260K Lean-verifiable theorems, each with a complete proof. Fine-tuning models on the human-curated training set and the augmented dataset results in average improvements of 4.0\% and 7.2\%, respectively, on \textsc{LeanComb}-Test set. The fine-tuned models also achieve promising performance on challenging ATP benchmarks, PutnamBench and CombiBench, demonstrating the effectiveness of our approach.

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Submission Number: 16005

Loading