INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Yuhuai Wu; Albert Jiang; Jimmy Ba; Roger Baker Grosse

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Yuhuai Wu, Albert Jiang, Jimmy Ba, Roger Baker Grosse

Published: 12 Jan 2021, Last Modified: 22 Jun 2025ICLR 2021 PosterReaders: Everyone

Keywords: Theorem proving, Synthetic benchmark dataset, Generalization, Transformers, Graph neural networks, Monte Carlo Tree Search

Abstract: In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents’ generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce base-lines with architectures including transformers and graph neural networks (GNNs)for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger out-of-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We introduce INT, a synthetic INequality Theorem proving benchmark, to tackle the data sparsity and out-of-distribution problems for theorem proving and benchmarked transformer-based and GNN-based agents' generalization performance.

Supplementary Material: zip

Code: [![github](/images/github_icon.svg) albertqjiang/INT](https://github.com/albertqjiang/INT)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/int-an-inequality-benchmark-for-evaluating/code)

15 Replies

Loading