Keywords: Theorem proving, Synthetic benchmark dataset, Generalization, Transformers, Graph neural networks, Monte Carlo Tree Search
Abstract: In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents’ generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce base-lines with architectures including transformers and graph neural networks (GNNs)for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger out-of-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.
One-sentence Summary: We introduce INT, a synthetic INequality Theorem proving benchmark, to tackle the data sparsity and out-of-distribution problems for theorem proving and benchmarked transformer-based and GNN-based agents' generalization performance.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Code: [![github](/images/github_icon.svg) albertqjiang/INT](https://github.com/albertqjiang/INT)