IsarStep: a Benchmark for High-level Mathematical Reasoning

Wenda Li; Lei Yu; Yuhuai Wu; Lawrence C. Paulson

IsarStep: a Benchmark for High-level Mathematical Reasoning

Wenda Li, Lei Yu, Yuhuai Wu, Lawrence C. Paulson

Published: 12 Jan 2021, Last Modified: 05 May 2023ICLR 2021 PosterReaders: Everyone

Keywords: mathematical reasoning, dataset, benchmark, reasoning, transformer

Abstract: A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models.

Supplementary Material: zip

Code: [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](https://paperswithcode.com/paper/?openreview=Pzj6fzU6wkj)

16 Replies

Loading